改变未来的九种算法
Nine Algorithms That
Changed the Future
推动当今计算机发展的巧妙理念
THE INGENIOUS IDEAS THAT
DRIVE TODAY'S COMPUTERS
约翰·麦考密克
John MacCormick
普林斯顿大学出版社 普林斯顿和
牛津
P R I N C E T O N U N I V E R S I T Y P R E S S
P R I N C E T O N A N D O X F O R D
版权所有 © 2012 普林斯顿大学出版社
Copyright © 2012 by Princeton University Press
普林斯顿大学出版社出版,
新泽西州普林斯顿市威廉街41号,08540
Published by Princeton University Press,
41 William Street, Princeton, New Jersey 08540
英国:普林斯顿大学出版社,
牛津街6号,伍德斯托克,牛津郡OX20 1TW
In the United Kingdom: Princeton University Press,
6 Oxford Street, Woodstock, Oxfordshire OX20 1TW
版权所有
All Rights Reserved
美国国会图书馆出版编目数据
Library of Congress Cataloging-in-Publication Data
MacCormick, John,1972年——《改变未来的九种算法:驱动 当今计算机的
巧妙理念》/ John MacCormick, 第15-16页。 包含参考文献和索引。ISBN 978-0-691-14714-7(精装本:纸质版) 1. 计算机科学。2. 计算机算法。3 . 人工智能。I. 标题。
MacCormick, John, 1972–
Nine algorithms that changed the future : the ingenious ideas that drive
today's computers / John MacCormick.
p. cm.
Includes bibliographical references and index.
ISBN 978-0-691-14714-7 (hardcover : alk. paper)
1. Computer science. 2. Computer algorithms.
3. Artificial intelligence. I. Title.
QA76M21453 2012
006.3-dc22 2011008867
QA76M21453 2012
006.3-dc22 2011008867
这本书的目录记录可从大英图书馆获取
A catalogue record for this book is available from the British Library
本书使用 Lucida 编写,使用 T E X
This book has been composed in Lucida using TEX
由伦敦 T&T Productions Ltd 排版
Typeset by T&T Productions Ltd, London
印刷于无酸纸上
Printed on acid-free paper
美国印刷
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
世界已经进入了廉价、复杂且高度可靠的设备的时代
;并且必将带来一些成果。
The world has arrived at an age of cheap complex devices of
great reliability; and something is bound to come of it.
——万尼瓦尔·布什,《诚如我们所想》,1945年
—Vannevar Bush, “As We May Think,” 1945
内容
CONTENTS
1. Introduction: What Are the Extraordinary Ideas Computers Use Every Day?
2. Search Engine Indexing: Finding Needles in the World's Biggest Haystack
3. PageRank: The Technology That Launched Google
4. Public Key Cryptography: Sending Secrets on a Postcard
5. Error-Correcting Codes: Mistakes That Fix Themselves
6. Pattern Recognition: Learning from Experience
7. Data Compression: Something for Nothing
8. Databases: The Quest for Consistency
9. Digital Signatures: Who Really Wrote This Software?
前言
FOREWORD
计算正在深刻地改变着我们的社会,其深刻程度堪比过去两个世纪物理学和化学所带来的变革。事实上,我们生活中几乎无处不受到数字技术的影响,甚至彻底改变了它。鉴于计算对现代社会的重要性,人们对使其成为可能的基本概念却知之甚少,这多少有些自相矛盾。这些概念的研究是计算机科学的核心,而麦考密克的这本新书是为数不多的向大众呈现这些概念的书籍之一。
Computing is transforming our society in ways that are as profound as the changes wrought by physics and chemistry in the previous two centuries. Indeed, there is hardly an aspect of our lives that hasn't already been influenced, or even revolutionized, by digital technology. Given the importance of computing to modern society, it is therefore somewhat paradoxical that there is so little awareness of the fundamental concepts that make it all possible. The study of these concepts lies at the heart of computer science, and this new book by MacCormick is one of the relatively few to present them to a general audience.
计算机科学这门学科相对缺乏重视的原因之一是,它在高中阶段很少被教授。虽然物理和化学等学科的入门课程通常被认为是必修课,但通常只有在大学阶段才能独立学习计算机科学。此外,学校里通常教授的“计算”或“ICT”(信息和通信技术)通常只涉及软件包的使用技能培训。毫不奇怪,学生们觉得这些课程乏味乏味,他们对计算机技术在娱乐和通信领域的应用热情也因人们对此类技术的开发缺乏知识深度的印象而受到抑制。这些问题被认为是过去十年大学计算机科学专业学生人数下降50%的根本原因。鉴于数字技术对现代社会至关重要,现在正是重新吸引大众对计算机科学的热情的关键时刻。
One reason for the relative lack of appreciation of computer science as a discipline is that it is rarely taught in high school. While an introduction to subjects such as physics and chemistry is generally considered mandatory, it is often only at the college or university level that computer science can be studied in its own right. Furthermore, what is often taught in schools as “computing” or “ICT” (information and communication technology) is generally little more than skills training in the use of software packages. Unsurprisingly, pupils find this tedious, and their natural enthusiasm for the use of computer technology in entertainment and communication is tempered by the impression that the creation of such technology is lacking in intellectual depth. These issues are thought to be at the heart of the 50 percent decline in the number of students studying computer science at university over the last decade. In light of the crucial importance of digital technology to modern society, there has never been a more important time to re-engage our population with the fascination of computer science.
2008年,我有幸被选中担任第180届皇家学会圣诞讲座的演讲嘉宾。该讲座由迈克尔·法拉第于1826年发起。2008年的讲座是皇家学会首次以计算机科学为主题。在准备这些讲座的过程中,我花了很多时间思考如何向普通听众讲解计算机科学,并意识到相关的资源非常少,几乎没有畅销书能够满足这一需求。因此,麦考密克的这本新书尤其受欢迎。
In 2008 I was fortunate in being selected to present the 180th series of Royal Institution Christmas Lectures, which were initiated by Michael Faraday in 1826. The 2008 lectures were the first time they had been given on the theme of computer science. When preparing these lectures I spent much time thinking about how to explain computer science to a general audience, and realized that there are very few resources, and almost no popular books, that address this need. This new book by MacCormick is therefore particularly welcome.
MacCormick 出色地将计算机科学的复杂理念带给了普通读者。其中许多理念都拥有非凡的美感和优雅,仅凭这一点就值得关注。举一个例子:网络商务的爆炸式增长之所以成为可能,是因为能够通过互联网秘密安全地发送机密信息(例如信用卡号)。几十年来,在“开放”信道上建立安全通信一直被认为是一个难题。当解决方案最终被找到时,它被证明非常优雅,并且 MacCormick 使用了精确的类比来解释,无需任何计算机科学知识。这些宝贵的资源使这本书成为科普书籍中不可多得的宝贵贡献,我对此高度赞扬。
MacCormick has done a superb job of bringing complex ideas from computer science to a general audience. Many of these ideas have an extraordinary beauty and elegance which alone makes them worthy of attention. To give just one example: the explosive growth of web-based commerce is only possible because of the ability to send confidential information (such as credit card numbers, for example) secretly and securely across the Internet. The fact that secure communication can be established over “open” channels was for decades thought to be an intractable problem. When a solution was found, it turned out to be remarkably elegant, and is explained by MacCormick using precise analogies that require no prior knowledge of computer science. Such gems make this book an invaluable contribution to the popular science bookshelf, and I highly commend it.
克里斯·毕晓普 (Chris Bishop)微软剑桥研究院
杰出科学家英国皇家学会副主席爱丁堡大学计算机科学教授
Chris Bishop
Distinguished Scientist, Microsoft Research Cambridge
Vice President, The Royal Institution of Great Britain
Professor of Computer Science, University of Edinburgh
1
1
引言:计算机每天都在使用哪些非凡的想法?
Introduction: What Are the Extraordinary Ideas Computers Use Every Day?
—威廉· S·哈克斯比亚,《爱的徒劳》
—WILLIAM SHAKESPEARE, Love's Labour's Lost
计算机科学的伟大思想是如何诞生的?以下是一些例子:
How were the great ideas of computer science born? Here's a selection:
• 20 世纪 30 年代,在第一台数字计算机诞生之前,一位英国天才创立了计算机科学领域,并进一步证明,未来制造的任何计算机,无论速度多快、功能多强大、设计多巧妙,都无法解决某些问题。
• In the 1930s, before the first digital computer has even been built, a British genius founds the field of computer science, then goes on to prove that certain problems cannot be solved by any computer to be built in the future, no matter how fast, powerful, or cleverly designed.
• 1948年,一位在电话公司工作的科学家发表了一篇论文,开创了信息论领域。他的研究成果使得计算机即使在大部分数据受到干扰损坏的情况下,也能以完美的精度传输信息。
• In 1948, a scientist working at a telephone company publishes a paper that founds the field of information theory. His work will allow computers to transmit a message with perfect accuracy even when most of the data is corrupted by interference.
• 1956年,一群学者在达特茅斯参加了一场会议,他们怀着明确而大胆的目标:创立人工智能领域。在经历了无数辉煌的成功和巨大的失望之后,我们仍在等待真正智能的计算机程序的出现。
• In 1956, a group of academics attend a conference at Dartmouth with the explicit and audacious goal of founding the field of artificial intelligence. After many spectacular successes and numerous great disappointments, we are still waiting for a truly intelligent computer program to emerge.
• 1969年,IBM的一位研究人员发现了一种构建数据库信息的新方法。该技术如今被用于存储和检索大多数在线交易所需的信息。
• In 1969, a researcher at IBM discovers an elegant new way to structure the information in a database. The technique is now used to store and retrieve the information underlying most online transactions.
• 1974年,英国政府秘密通信实验室的研究人员发现了一种让计算机即使在另一台计算机能够观察到它们之间所有传输内容的情况下也能安全通信的方法。研究人员受到政府保密的约束——但幸运的是,三位美国教授独立发现并扩展了这项令人惊叹的发明,它是互联网上所有安全通信的基础。
• In 1974, researchers in the British government's lab for secret communications discover a way for computers to communicate securely even when another computer can observe everything that passes between them. The researchers are bound by government secrecy—but fortunately, three American professors independently discover and extend this astonishing invention that underlies all secure communication on the internet.
• 1996年,斯坦福大学的两名博士生决定合作开发一个网络搜索引擎。几年后,他们创建了谷歌——互联网时代的第一个数字巨头。
• In 1996, two Ph.D. students at Stanford University decide to collaborate on building a web search engine. A few years later, they have created Google, the first digital giant of the internet era.
在我们享受着21世纪科技的惊人发展的同时,如果不依赖20世纪诞生的计算机科学基本思想,我们已无法使用任何计算设备——无论是由当时最强大的机器组成的集群,还是最新潮的手持设备。想一想:你今天做过什么令人印象深刻的事情吗?答案取决于你的观点。你是否搜索过数十亿份文档,并从中挑选出两三份与你需求最相关的文档?你是否存储或传输了数百万条信息,而没有犯任何错误——尽管所有电子设备都存在电磁干扰?你是否在成千上万的其他客户同时访问同一台服务器的情况下成功完成了在线交易?你是否在可能被数十台其他计算机窃取的线路上安全地传输了一些机密信息(例如,你的信用卡号)?你是否使用过神奇的压缩技术,将几兆字节的照片压缩到更易于通过电子邮件发送的大小?或者,您是否没有经过思考就利用了手持设备中的人工智能,该设备可以在其小键盘上自我纠正您的打字内容?
As we enjoy the astonishing growth of technology in the 21st century, it has become impossible to use a computing device—whether it be a cluster of the most powerful machines available or the latest, most fashionable handheld device—without relying on the fundamental ideas of computer science, all born in the 20th century. Think about it: have you done anything impressive today? Well, the answer depends on your point of view. Have you, perhaps, searched a corpus of billions of documents, picking out the two or three that are most relevant to your needs? Have you stored or transmitted many millions of pieces of information, without making a single mistake—despite the electromagnetic interference that affects all electronic devices? Did you successfully complete an online transaction, even though many thousands of other customers were simultaneously hammering the same server? Did you communicate some confidential information (for example, your credit card number) securely over wires that can be snooped by dozens of other computers? Did you use the magic of compression to reduce a multimegabyte photo down to a more manageable size for sending in an e-mail? Or did you, without even thinking about it, exploit the artificial intelligence in a hand-held device that self-corrects your typing on its tiny keyboard?
这些令人瞩目的成就都依赖于前面列出的那些意义深远的发现。因此,大多数计算机用户每天都会多次运用这些巧妙的想法,甚至常常没有意识到!本书的目标是向尽可能广泛的读者解释这些概念——我们每天都在使用的计算机科学的伟大思想。每个概念的解释都不需要任何计算机科学知识。
Each of these impressive feats relies on the profound discoveries listed earlier. Thus, most computer users employ these ingenious ideas many times every day, often without even realizing it! It is the objective of this book to explain these concepts—the great ideas of computer science that we use every day—to the widest possible audience. Each concept is explained without assuming any knowledge of computer science.
算法:触手可及的天才基石
ALGORITHMS: THE BUILDING BLOCKS OF THE GENIUS AT YOUR FINGERTIPS
到目前为止,我一直在谈论计算机科学的伟大“思想”,但计算机科学家将他们的许多重要思想描述为“算法”。那么思想和算法之间有什么区别呢?算法究竟是什么?这个问题最简单的答案是,算法是一种精确的处方,它指定了解决问题所需的确切步骤顺序。一个很好的例子是我们小时候在学校学到的算法:将两个大数相加的算法。如上所示。该算法涉及一系列步骤,其开头如下:“首先,将两个数字的末位数字相加,记下结果的最后一位数字,并将其他数字进位到左边的下一列;其次,将下一列的数字相加,加上上一列的任何进位数字……”——依此类推。
So far, I've been talking about great “ideas” of computer science, but computer scientists describe many of their important ideas as “algorithms.” So what's the difference between an idea and an algorithm? What, indeed, is an algorithm? The simplest answer to this question is to say that an algorithm is a precise recipe that specifies the exact sequence of steps required to solve a problem. A great example of this is an algorithm we all learn as children in school: the algorithm for adding two large numbers together. An example is shown above. The algorithm involves a sequence of steps that starts off something like this: “First, add the final digits of the two numbers together, write down the final digit of the result, and carry any other digits into the next column on the left; second, add the digits in the next column together, add on any carried digits from the previous column…”—and so on.
两个数字相加的算法中的前两个步骤。
The first two steps in the algorithm for adding two numbers.
请注意算法步骤近乎机械的感觉。事实上,这正是算法的关键特性之一:每个步骤都必须绝对精确,无需人类的直觉或猜测。这样,每个纯粹机械的步骤都可以被编程到计算机中。算法的另一个重要特性是,无论输入是什么,它总是有效的。我们在学校学到的加法算法确实具有这个特性:无论你尝试将哪两个数字相加,算法最终都会得出正确的答案。例如,虽然这会花费相当长的时间,但你完全可以使用这个算法将两个千位数字相加。
Note the almost mechanical feel of the algorithm's steps. This is, in fact, one of the key features of an algorithm: each of the steps must be absolutely precise, requiring no human intuition or guesswork. That way, each of the purely mechanical steps can be programmed into a computer. Another important feature of an algorithm is that it always works, no matter what the inputs. The addition algorithm we learned in school does indeed have this property: no matter what two numbers you try to add together, the algorithm will eventually yield the correct answer. For example, although it would take a rather long time, you could certainly use this algorithm to add two 1000-digit numbers together.
你可能对算法被定义为精确的机械配方感到有些好奇。配方究竟需要多精确?允许哪些基本操作?例如,在上面的加法算法中,简单地说“将两位数相加”就可以了吗?还是我们必须以某种方式指定整套针对个位数的加法表?这些细节可能看起来无关紧要,甚至有些迂腐,但事实证明,这完全是错误的:这些问题的真正答案正处于计算机科学的核心,并且与哲学、物理学、神经科学和遗传学息息相关。关于算法究竟是什么的深层问题可以归结为一个称为“丘奇-图灵论题”的命题。我们将在第10章中重新讨论这些问题,该章讨论了计算的理论极限以及丘奇-图灵论题的某些方面。同时,将算法视为非常精确的配方这一非正式概念将对我们大有裨益。
You may be a little curious about this definition of an algorithm as a precise, mechanical recipe. Exactly how precise does the recipe need to be? What fundamental operations are permitted? For example, in the addition algorithm above, is it okay to simply say “add the two digits together,” or do we have to somehow specify the entire set of addition tables for single-digit numbers? These details might seem innocuous or perhaps even pedantic, but it turns out that nothing could be further from the truth: the real answers to these questions lie right at the heart of computer science and also have connections to philosophy, physics, neuroscience, and genetics. The deep questions about what an algorithm really is all boil down to a proposition known as the Church-Turing thesis. We will revisit these issues in chapter 10, which discusses the theoretical limits of computation and some aspects of the Church-Turing thesis. Meanwhile, the informal notion of an algorithm as a very precise recipe will serve us perfectly well.
现在我们知道了什么是算法,但它和计算机有什么联系呢?关键在于,计算机需要编写非常精确的指令。因此,在让计算机为我们解决某个问题之前,我们需要为该问题开发一个算法。在其他科学学科中,例如数学和物理学,重要的结果通常由一个公式表达。(著名的例子包括勾股定理、a2 + b2 = c2或爱因斯坦的E = mc2 。)相比之下,计算机科学的伟大思想通常描述如何解决问题——当然是使用算法。所以,本书的主要目的是解释是什么让你的计算机变成了你自己的天才:你的计算机每天使用的伟大算法。
Now we know what an algorithm is, but what is the connection to computers? The key point is that computers need to be programmed with very precise instructions. Therefore, before we can get a computer to solve a particular problem for us, we need to develop an algorithm for that problem. In other scientific disciplines, such as mathematics and physics, important results are often captured by a single formula. (Famous examples include the Pythagorean theorem, a2 + b2 = c2, or Einstein's E = mc2.) In contrast, the great ideas of computer science generally describe how to solve a problem—using an algorithm, of course. So, the main purpose of this book is to explain what makes your computer into your own personal genius: the great algorithms your computer uses every day.
什么构成了伟大的算法?
WHAT MAKES A GREAT ALGORITHM?
这就引出了一个棘手的问题:哪些算法才是真正“伟大的”。候选算法的名单很长,但我用几个基本标准来缩减这份名单,以便本书更好地理解。第一个也是最重要的标准是,这些算法每天都会被普通计算机用户使用。第二个重要的标准是,这些算法应该能够解决具体的、现实世界的问题——例如压缩特定文件或在嘈杂的链路上准确传输文件。对于已经了解一些计算机科学知识的读者,下一页的方框解释了前两个标准的一些推论。
This brings us to the tricky question of which algorithms are truly “great.” The list of potential candidates is rather large, but I've used a few essential criteria to whittle down that list for this book. The first and most important criterion is that the algorithms are used by ordinary computer users every day. The second important criterion is that the algorithms should address concrete, real-world problems—problems like compressing a particular file or transmitting it accurately over a noisy link. For readers who already know some computer science, the box on the next page explains some of the consequences of these first two criteria.
第三个标准是算法主要与计算机科学理论相关。这排除了那些专注于计算机硬件(例如CPU、显示器和网络)的技术。它也减少了对互联网等基础设施设计的重视。为什么我选择关注计算机科学理论?部分原因是公众对计算机科学的认知存在失衡:人们普遍认为计算机科学主要涉及编程(即“软件”)和设备设计(即“硬件”)。事实上,计算机科学中许多最美好的想法都是完全抽象的,不属于这两类。通过强调这些理论思想,我希望更多人能够开始理解计算机科学作为一门知识学科的本质。
The third criterion is that the algorithms relate primarily to the theory of computer science. This eliminates techniques that focus on computer hardware, such as CPUs, monitors, and networks. It also reduces emphasis on design of infrastructure such as the internet. Why do I choose to focus on computer science theory? Part of my motivation is the imbalance in the public's perception of computer science: there is a widespread belief that computer science is mostly about programming (i.e., “software”) and the design of gadgets (i.e., “hardware”). In fact, many of the most beautiful ideas in computer science are completely abstract and don't fall in either of these categories. By emphasizing these theoretical ideas, it is my hope that more people will begin to understand the nature of computer science as an intellectual discipline.
你可能已经注意到,我一直在列出一些标准来筛选潜在的优秀算法,却回避了定义优秀算法这个更棘手的问题。为此,我依靠的是自己的直觉。本书解释的每个算法的核心,都蕴含着一个巧妙的技巧,让整个算法得以运作。当这个技巧被揭示时,那种“顿悟”的时刻,让我在解释这些算法时感到无比兴奋,希望对你也是如此。由于我会频繁使用“技巧”这个词,因此需要指出,我指的并非那种卑鄙或欺骗性的技巧——那种小孩子可能会对弟弟妹妹耍的技巧。相反,本书中的技巧类似于行业技巧,甚至是魔术:用巧妙的技巧来实现原本难以实现的目标。
You may have noticed that I've been listing criteria to eliminate potential great algorithms, while avoiding the much more difficult issue of defining greatness in the first place. For this, I've relied on my own intuition. At the heart of every algorithm explained in the book is an ingenious trick that makes the whole thing work. The presence of an “aha” moment, when this trick is revealed, is what makes the explanation of these algorithms an exhilarating experience for me and hopefully also for you. Since I'll be using the word “trick” a great deal, I should point out that I'm not talking about the kind of tricks that are mean or deceitful—the kind of trick a child might play on a younger brother or sister. Instead, the tricks in this book resemble tricks of the trade or even magic tricks: clever techniques for accomplishing goals that would otherwise be difficult or impossible.
第一个标准——普通计算机用户的日常使用——排除了主要由计算机专业人员使用的算法,例如编译器和程序验证技术。第二个标准——针对特定问题的具体应用——排除了许多本科计算机科学课程的核心优秀算法。这包括快速排序之类的排序算法、Dijkstra 最短路径算法之类的图算法,以及哈希表之类的数据结构。这些算法无疑是优秀的,而且它们很容易满足第一个标准,因为普通用户运行的大多数应用程序都会重复使用它们。但这些算法是通用的:它们可以应用于各种不同的问题。在本书中,我选择关注针对特定问题的算法,因为它们对普通计算机用户来说有更明确的动机。
The first criterion—everyday use by ordinary computer users—eliminates algorithms used primarily by computer professionals, such as compilers and program verification techniques. The second criterion—concrete application to a specific problem—eliminates many of the great algorithms that are central to the undergraduate computer science curriculum. This includes sorting algorithms like quicksort, graph algorithms such as Dijkstra's shortest-path algorithm, and data structures such as hash tables. These algorithms are indisputably great and they easily meet the first criterion, since most application programs run by ordinary users employ them repeatedly. But these algorithms are generic: they can be applied to a vast array of different problems. In this book, I have chosen to focus on algorithms for specific problems, since they have a clearer motivation for ordinary computer users.
关于本书算法选择的一些补充细节。本书读者无需具备任何计算机科学知识。但如果你有计算机科学背景,本框将解释为什么本书没有涵盖许多你以前喜欢的算法。
Some additional details about the selection of algorithms for this book. Readers of this book are not expected to know any computer science. But if you do have a background in computer science, this box explains why many of your old favorites aren't covered in the book.
因此,我运用自己的直觉,挑选出我认为是计算机科学领域中最巧妙、最神奇的技巧。英国数学家 G. H. 哈代在其著作《数学家的辩解》中曾这样说过,他试图向公众解释数学家们这样做的原因:“美是第一道检验标准:丑陋的数学在这个世界上没有永久的容身之地。”同样的美的检验标准也适用于计算机科学背后的理论思想。因此,本书中提出的算法的最终标准就是我们所说的哈代美感测试:我希望我已经成功地向读者传达了我个人认为存在于每个算法中的至少一部分美感。
Thus, I've used my own intuition to pick out what I believe are the most ingenious, magical tricks out there in the world of computer science. The British mathematician G. H. Hardy famously put it this way in his book A Mathematician's Apology, in which he tried to explain to the public why mathematicians do what they do: “Beauty is the first test: there is no permanent place in the world for ugly mathematics.” This same test of beauty applies to the theoretical ideas underlying computer science. So the final criterion for the algorithms presented in this book is what we might call Hardy's beauty test: I hope I have succeeded in conveying to the reader at least some portion of the beauty that I personally feel is present in each of the algorithms.
让我们继续讨论我选择介绍的具体算法。搜索引擎的深远影响或许是影响所有计算机用户的算法技术最明显的例子,因此我包含一些网络搜索的核心算法也就不足为奇了。第 2 章描述了搜索引擎如何使用索引来查找与查询匹配的文档,第 3 章解释了PageRank—— Google 使用的算法的原始版本,用于确保最相关的匹配文档位于结果列表的顶部。
Let's move on to the specific algorithms I chose to present. The profound impact of search engines is perhaps the most obvious example of an algorithmic technology that affects all computer users, so it's not surprising that I included some of the core algorithms of web search. Chapter 2 describes how search engines use indexing to find documents that match a query, and chapter 3 explains PageRank— the original version of the algorithm used by Google to ensure that the most relevant matching documents are at the top of the results list.
即使我们很少停下来思考,大多数人至少也意识到搜索引擎正在运用一些深奥的计算机科学思想来提供其极其强大的结果。相比之下,其他一些优秀的算法经常在计算机用户不自觉的情况下被调用。第 4 章中介绍的公钥加密就是这样一种算法。每次访问安全网站(地址以 https 而不是 http 开头)时,您都会使用公钥加密中称为密钥交换的功能来建立安全会话。第 4 章解释了如何实现这种密钥交换。
Even if we don't stop to think about it very often, most of us are at least aware that search engines are using some deep computer science ideas to provide their incredibly powerful results. In contrast, some of the other great algorithms are frequently invoked without the computer user even realizing it. Public key cryptography, described in chapter 4, is one such algorithm. Every time you visit a secure website (with https instead of http at the start of its address), you use the aspect of public key cryptography known as key exchange to set up a secure session. Chapter 4 explains how this key exchange is achieved.
第五章的主题是纠错码,这是另一类我们经常使用却不知不觉的算法。事实上,纠错码可能是有史以来最常用的伟大思想。它们使计算机能够识别并纠正存储或传输数据中的错误,而无需借助备份或重传。纠错码无处不在:它们用于所有硬盘驱动器、许多网络传输、CD 和 DVD,甚至一些计算机内存中——但它们的工作效率如此之高,以至于我们甚至从未意识到它们的存在。
The topic of chapter 5, error correcting codes, is another class of algorithms that we use constantly without realizing it. In fact, error correcting codes are probably the single most frequently used great idea of all time. They allow a computer to recognize and correct errors in stored or transmitted data, without having to resort to a backup copy or a retransmission. These codes are everywhere: they are used in all hard disk drives, many network transmissions, on CDs and DVDs, and even in some computer memories—but they do their job so well that we are never even aware of them.
第六章略显特殊。它涵盖了模式识别算法,尽管它违背了第一个标准:普通计算机用户每天都必须使用它们,但它却悄然跻身伟大的计算机科学理念之列。模式识别是计算机识别高度可变信息(例如手写、语音和人脸)的一类技术。事实上,在21世纪的第一个十年,大多数日常计算并没有使用这些技术。但就在我2011年写下这些文字的时候,模式识别的重要性正在迅速提升:带有小型屏幕键盘的移动设备需要自动校正,平板电脑必须识别手写输入,所有这些设备(尤其是智能手机)都越来越多地支持语音控制。一些网站甚至使用模式识别来确定向用户展示什么样的广告。此外,我个人对模式识别颇有偏好,这是我自己的研究领域。因此,第六章描述了三种最有趣且最成功的模式识别技术:最近邻分类器、决策树和神经网络。
Chapter 6 is a little exceptional. It covers pattern recognition algorithms, which sneak into the list of great computer science ideas despite violating the very first criterion: that ordinary computer users must use them every day. Pattern recognition is the class of techniques whereby computers recognize highly variable information, such as handwriting, speech, and faces. In fact, in the first decade of the 21st century, most everyday computing did not use these techniques. But as I write these words in 2011, the importance of pattern recognition is increasing rapidly: mobile devices with small on-screen keyboards need automatic correction, tablet devices must recognize handwritten input, and all these devices (especially smartphones) are becoming increasingly voice-activated. Some websites even use pattern recognition to determine what kind of advertisements to display to their users. In addition, I have a personal bias toward pattern recognition, which is my own area of research. So chapter 6 describes three of the most interesting and successful pattern recognition techniques: nearest-neighbor classifiers, decision trees, and neural networks.
第七章讨论的压缩算法,构成了另一套伟大的理念,它们帮助我们将计算机变成触手可及的天才。计算机用户有时会直接应用压缩,例如为了节省磁盘空间,或在通过电子邮件发送照片之前缩小尺寸。但压缩更常被暗中使用:在我们不知情的情况下,我们的下载或上传数据可能会被压缩以节省带宽,数据中心也经常压缩客户的数据以降低成本。你的电子邮件提供商允许的 5GB 空间,可能比提供商 5GB 的存储空间占用量要少得多!
Compression algorithms, discussed in chapter 7, form another set of great ideas that help transform a computer into a genius at our fingertips. Computer users do sometimes apply compression directly, perhaps to save space on a disk or to reduce the size of a photo before e-mailing it. But compression is used even more often under the covers: without us being aware of it, our downloads or uploads may be compressed to save bandwidth, and data centers often compress customers' data to reduce costs. That 5 GB of space that your e-mail provider allows you probably occupies significantly less than 5 GB of the provider's storage!
第 8 章介绍了数据库的一些基本算法。本章重点介绍了实现一致性(即数据库中的关系永远不会相互矛盾)所采用的巧妙技术。如果没有这些巧妙的技术,我们的大部分在线生活(包括在线购物和与 Facebook 等社交网络互动)都会在一堆计算机错误中崩溃。本章解释了一致性问题的真正含义,以及计算机科学家如何在不牺牲我们期望的在线系统卓越效率的情况下解决这个问题。
Chapter 8 covers some of the fundamental algorithms underlying databases. The chapter emphasizes the clever techniques employed to achieve consistency—meaning that the relationships in a database never contradict each other. Without these ingenious techniques, most of our online life (including online shopping and interacting with social networks like Facebook) would collapse in a jumble of computer errors. This chapter explains what the problem of consistency really is and how computer scientists solve it without sacrificing the formidable efficiency we expect from online systems.
在第九章中,我们将学习理论计算机科学中无可争议的瑰宝之一:数字签名。乍一看,对电子文档进行数字“签名”似乎是不可能的。你肯定会想,任何这样的签名肯定都包含数字信息,任何想要伪造签名的人都可以毫不费力地复制这些信息。解决这一悖论是计算机科学最杰出的成就之一。
In chapter 9, we learn about one of the indisputable gems of theoretical computer science: digital signatures. The ability to “sign” an electronic document digitally seems impossible at first glance. Surely, you might think, any such signature must consist of digital information, which can be copied effortlessly by anyone wishing to forge the signature. The resolution of this paradox is one of the most remarkable achievements of computer science.
在第10章中,我们采用了完全不同的思路:我们不再描述一个已经存在的伟大算法,而是学习一个如果存在就会很棒的算法。令人惊讶的是,我们会发现这个伟大的算法根本不可能实现。这为计算机解决问题的能力设定了一些绝对的极限,我们将简要讨论这一结果对哲学和生物学的影响。
We take a completely different tack in chapter 10: instead of describing a great algorithm that already exists, we will learn about an algorithm that would be great if it existed. Astonishingly, we will discover that this particular great algorithm is impossible. This establishes some absolute limits on the power of computers to solve problems, and we will briefly discuss the implications of this result for philosophy and biology.
在结论部分,我们将总结这些优秀算法的一些共同点,并花些时间推测一下未来可能的发展方向。未来是否还会有更多优秀的算法,还是我们已经发现了它们?
In the conclusion, we will draw together some common threads from the great algorithms and spend a little time speculating about what the future might hold. Are there more great algorithms out there or have we already found them all?
现在是时候提一下关于本书风格的一些注意事项了。任何科学写作都必须清晰地注明来源,但引用会打断文本的流畅性,使其显得学术化。由于本书的可读性和易读性是其首要考虑因素,因此正文中没有引用。不过,所有来源都清晰地标注在书末的“来源和延伸阅读”部分,通常还会附上补充说明。本节还提供了其他资料,感兴趣的读者可以通过这些资料了解更多关于计算机科学的伟大算法。
This is a good time to mention a caveat about the book's style. It's essential for any scientific writing to acknowledge sources clearly, but citations break up the flow of the text and give it an academic flavor. As readability and accessibility are top priorities for this book, there are no citations in the main body of the text. All sources are, however, clearly identified—often with amplifying comments—in the “Sources and Further Reading” section at the end of the book. This section also points to additional material that interested readers can use to find out more about the great algorithms of computer science.
在处理注意事项的同时,我还应该提一下,这本书的书名带有一点诗意。毫无疑问, 《改变未来的九种算法》具有革命性,但它们真的有九种吗?这个问题值得商榷,取决于具体什么算法才算作一种。那么,让我们看看“九”从何而来。除了引言和结论,本书共有九章,每章都涵盖了彻底改变不同类型计算任务的算法,例如密码学、压缩或模式识别。因此,书名中的“九种算法”实际上指的是用于解决这九种计算任务的九类算法。
While I'm dealing with caveats, I should also mention that a small amount of poetic license was taken with the book's title. Our Nine Algorithms That Changed the Future are—without a doubt—revolutionary, but are there exactly nine of them? This is debatable, and depends on exactly what gets counted as a separate algorithm. So let's see where the “nine” comes from. Excluding the introduction and conclusion, there are nine chapters in the book, each covering algorithms that have revolutionized a different type of computational task, such as cryptography, compression, or pattern recognition. Thus, the “Nine Algorithms” of the book's title really refer to nine classes of algorithms for tackling these nine computational tasks.
我们为什么要关心伟大的算法?
WHY SHOULD WE CARE ABOUT THE GREAT ALGORITHMS?
希望以上对即将介绍的精彩理念的简要概述,能让您渴望深入了解它们的实际工作原理。但您可能仍在疑惑:本书的最终目标是什么?那么,让我简要介绍一下本书的真正目的。它绝对不是一本指南。读完本书,您不会成为计算机安全、人工智能或其他领域的专家。但您确实可以学到一些有用的技能。例如:您将更加了解如何检查“安全”网站和“已签名”软件包的凭据;您将能够针对不同任务明智地选择有损压缩和无损压缩;并且,通过了解搜索引擎的索引和排名技术的某些方面,您将能够更有效地使用搜索引擎。
Hopefully, this quick summary of the fascinating ideas to come has left you eager to dive in and find out how they really work. But you may still be wondering: what is the ultimate goal here? So let me make some brief remarks about the true purpose of this book. It is definitely not a how-to manual. After reading the book, you won't be an expert on computer security or artificial intelligence or anything else. It's true that you may pick up some useful skills. For example: you'll be more aware of how to check the credentials of “secure” websites and “signed” software packages; you'll be able to choose judiciously between lossy and lossless compression for different tasks; and you may be able to use search engines more efficiently by understanding some aspects of their indexing and ranking techniques.
然而,与本书的真正目标相比,这些只是些微不足道的额外收获。读完本书后,你不会成为一名技艺精湛的电脑用户,但你会对你日复一日在所有计算设备上不断运用的理念之美有更深刻的理解。
These, however, are relatively minor bonuses compared to the book's true objective. After reading the book, you won't be a vastly more skilled computer user. But you will have a much deeper appreciation of the beauty of the ideas you are constantly using, day in and day out, on all your computing devices.
为什么这是件好事?让我打个比方。我绝对不是天文学专家——事实上,我对这个话题相当无知,真希望自己能了解更多。但每当我凝视夜空时,我所掌握的那点天文学知识反而提升了我的观星体验。不知何故,我对所见事物的理解带来了一种满足感和惊叹感。我热切希望,读完这本书后,你在使用电脑时也能偶尔体验到同样的满足感和惊叹感。你将真正体会到我们这个时代最普遍、最难以捉摸的黑匣子:你的个人电脑,你指尖上的天才。
Why is this a good thing? Let me argue by analogy. I am definitely not an expert on astronomy—in fact, I'm rather ignorant on the topic and wish I knew more. But every time I glance at the night sky, the small amount of astronomy that I do know enhances my enjoyment of this experience. Somehow, my understanding of what I am looking at leads to a feeling of contentment and wonder. It is my fervent hope that after reading this book, you will occasionally achieve this same sense of contentment and wonder while using a computer. You'll have a true appreciation of the most ubiquitous, inscrutable black box of our times: your personal computer, the genius at your fingertips.
2
2
搜索引擎索引:大海捞针
Search Engine Indexing: Finding Needles in the World's Biggest Haystack
—马克·特温,《汤姆·索亚历险记》
—MARK TWAIN, Tom Sawyer
搜索引擎对我们的生活影响深远。我们大多数人每天都会多次搜索,却很少停下来思考这个神奇的工具究竟是如何运作的。海量的信息以及快速且高质量的搜索结果已变得习以为常,以至于如果一个问题在几秒钟内得不到解答,我们实际上会感到沮丧。我们往往会忘记,每一次成功的网络搜索都如同从世界上最大的大海捞针——万维网——中捞出一根针。
Search engines have a profound effect on our lives. Most of us issue search queries many times a day, yet we rarely stop to wonder just how this remarkable tool can possibly work. The vast amount of information available and the speed and quality of the results have come to seem so normal that we actually get frustrated if a question can't be answered within a few seconds. We tend to forget that every successful web search extracts a needle from the world's largest haystack: the World Wide Web.
事实上,搜索引擎提供的卓越服务并非仅仅依靠大量尖端技术。诚然,各大搜索引擎公司都运营着一个由庞大数据中心组成的国际网络,其中包含数千台服务器和先进的网络设备。但如果没有组织和检索我们所需信息的智能算法,所有这些硬件都将毫无用处。因此,在本章及后续章节中,我们将探讨一些在我们每次进行网络搜索时都会用到的算法精华。我们很快就会看到,搜索引擎的两个主要任务是匹配和排名。本章介绍一种巧妙的匹配技巧:元词技巧。下一章,我们将讨论排名任务,并分析谷歌著名的PageRank算法。
In fact, the superb service provided by search engines is not just the result of throwing a large amount of fancy technology at the problem. Yes, each of the major search engine companies runs an international network of enormous data centers, containing thousands of server computers and advanced networking equipment. But all of this hardware would be useless without the clever algorithms needed to organize and retrieve the information we request. So in this chapter and the one that follows, we'll investigate some of the algorithmic gems that are put to work for us every time we do a web search. As we'll soon see, two of the main tasks for a search engine are matching and ranking. This chapter covers a clever matching technique: the metaword trick. In the next chapter, we turn to the ranking task and examine Google's celebrated PageRank algorithm.
匹配和排名
MATCHING AND RANKING
首先,从宏观角度了解您发出网页搜索查询时发生的情况会很有帮助。如前所述,查询主要分为两个阶段:匹配和排名。实际上,搜索引擎会将匹配和排名合并为一个流程以提高效率。但这两个阶段在概念上是分开的,因此我们假设匹配在排名开始之前完成。上图显示了一个示例,其中查询为“伦敦公交时刻表”。匹配阶段回答的是“哪些网页与我的查询匹配?”——在本例中,所有提及伦敦公交时刻表的网页。
It will be helpful to begin with a high-level view of what happens when you issue a web search query. As already mentioned, there will be two main phases: matching and ranking. In practice, search engines combine matching and ranking into a single process for efficiency. But the two phases are conceptually separate, so we'll assume that matching is completed before ranking begins. The figure above shows an example, where the query is “London bus timetable.” The matching phase answers the question “which web pages match my query?”—in this case, all pages that mention London bus timetables.
网络搜索分为两个阶段:匹配和排名。第一阶段(匹配)结束后,可能会出现数千甚至数百万个匹配结果,而这些匹配结果必须在第二阶段(排名)按相关性进行排序。
The two phases of web search: matching and ranking. There can be thousands or millions of matches after the first (matching) phase, and these must be sorted by relevance in the second (ranking) stage.
但现实中的搜索引擎中,许多查询都有数百、数千甚至数百万个结果。而搜索引擎用户通常只倾向于浏览少数结果,最多只有五到十个。因此,搜索引擎必须能够从海量结果中挑选出最合适的几个。一个好的搜索引擎不仅会挑选出最合适的几个结果,还会按照最合适的顺序显示它们——最合适的页面列在最前面,然后是次合适的页面,依此类推。
But many queries on real search engines have hundreds, thousands, or even millions of hits. And the users of search engines generally prefer to look through only a handful of results, perhaps five or ten at the most. Therefore, a search engine must be capable of picking the best few from a very large number of hits. A good search engine will not only pick out the best few hits, but display them in the most useful order—with the most suitable page listed first, then the next most suitable, and so on.
按正确顺序挑选出最佳结果的任务称为“排名”。这是继初始匹配阶段之后至关重要的第二个阶段。在竞争激烈的搜索行业中,搜索引擎的生存取决于其排名系统的质量。2002年,美国三大搜索引擎的市场份额大致相当,谷歌、雅虎和MSN各自占据了美国搜索市场的近30%。(MSN后来更名为Live Search,然后更名为Bing。)在接下来的几年里,谷歌的市场份额大幅提升,将雅虎和MSN的份额分别压低至20%以下。人们普遍认为,谷歌之所以能一跃成为搜索行业的领头羊,得益于其排名算法。因此,毫不夸张地说,搜索引擎的生存取决于其排名算法的质量。但正如前面提到的,我们将在下一章讨论排名算法。现在,我们先来关注匹配阶段。
The task of picking out the best few hits in the right order is called “ranking.” This is the crucial second phase that follows the initial matching phase. In the cutthroat world of the search industry, search engines live or die by the quality of their ranking systems. Back in 2002, the market share of the top three search engines in the United States was approximately equal, with Google, Yahoo, and MSN each having just under 30% of U.S. searches. (MSN was later rebranded first as Live Search and then as Bing.) In the next few years, Google made a dramatic improvement in its market share, crushing Yahoo and MSN down to under 20% each. It is widely believed that the phenomenal rise of Google to the top of the search industry was due to its ranking algorithms. So it's no exaggeration to say that search engines live or die according to the quality of their ranking algorithms. But as already mentioned, we'll be discussing ranking algorithms in the next chapter. For now, let's focus on the matching phase.?
ALTAVISTA:第一个网络规模的匹配算法
ALTAVISTA: THE FIRST WEB-SCALE MATCHING ALGORITHM
搜索引擎匹配算法的故事该从何说起?一个显而易见却又错误的答案是从谷歌说起,它是21世纪初最伟大的科技成功案例。事实上,谷歌最初是斯坦福大学两名研究生的博士项目,其起源既令人感动又令人印象深刻。1998年,拉里·佩奇和谢尔盖·布林将一堆杂乱无章的计算机硬件组装成一种新型搜索引擎。不到十年后,他们的公司就成为了互联网时代崛起的最伟大的数字巨头。
Where does our story of search engine matching algorithms begin? An obvious—but wrong—answer would be to start with Google, the greatest technology success story of the early 21st century. Indeed, the story of Google's beginnings as the Ph.D. project of two graduate students at Stanford University is both heartwarming and impressive. It was in 1998 that Larry Page and Sergey Brin assembled a ragtag bunch of computer hardware into a new type of search engine. Less than 10 years later, their company had become the greatest digital giant to rise in the internet age.
但网络搜索的概念其实早已存在多年。最早的商业产品包括Infoseek和Lycos(均于1994年推出),以及AltaVista(于1995年推出其搜索引擎)。在20世纪90年代中期的几年里,AltaVista曾是搜索引擎之王。那段时间,我还是一名计算机科学研究生,我清楚地记得自己当时被AltaVista搜索结果的全面性所震撼。这是搜索引擎首次将网页上的所有文本都完整地编入索引——而且更棒的是,搜索结果可以在眨眼间返回。我们理解这一轰动性技术突破的旅程始于一个(字面意义上的)古老概念:索引。
But the idea of web search had already been around for several years. Among the earliest commercial offerings were Infoseek and Lycos (both launched in 1994), and AltaVista, which launched its search engine in 1995. For a few years in the mid-1990s, AltaVista was the king of the search engines. I was a graduate student in computer science during this period, and I have clear memories of being wowed by the comprehensiveness of AltaVista's results. For the first time, a search engine had fully indexed all of the text on every page of the web—and, even better, results were returned in the blink of an eye. Our journey toward understanding this sensational technological breakthrough begins with a (literally) age-old concept: indexing.
普通旧索引
PLAIN OLD INDEXING
索引的概念是任何搜索引擎背后最基本的理念。但搜索引擎并没有发明索引:事实上,索引的概念几乎和文字本身一样古老。例如,考古学家发现了一座有5000年历史的巴比伦神庙图书馆,里面的楔形文字板按主题进行了分类。因此,索引可以说是计算机科学中最古老的实用概念。
The concept of an index is the most fundamental idea behind any search engine. But search engines did not invent indexes: in fact, the idea of indexing is almost as old as writing itself. For example, archaeologists have discovered a 5000-year-old Babylonian temple library that cataloged its cuneiform tablets by subject. So indexing has a pretty good claim to being the oldest useful idea in computer science.
如今,“索引”一词通常指参考书末尾的某个部分。所有你可能想要查找的概念都按固定顺序(通常按字母顺序排列)列出,每个概念下方都列出了该概念被引用的位置(通常是页码)。因此,一本关于动物的书可能会有一个类似“cheetah 124, 156”的索引条目,这意味着“cheetah”这个词出现在第124页和第156页。(作为一个略带趣味的练习,你可以在本书的索引中查找“index”一词。你应该会被带回到这一页。)
These days, the word “index” usually refers to a section at the end of a reference book. All of the concepts you might want to look up are listed in a fixed order (usually alphabetical), and under each concept is a list of locations (usually page numbers) where that concept is referenced. So a book on animals might have an index entry that looks like “cheetah 124, 156,” which means that the word “cheetah” appears on pages 124 and 156. (As a mildly amusing exercise, you could look up the word “index” in the index of this book. You should be brought back to this very page.)
网络搜索引擎的索引与书籍的索引工作原理相同。书中的“页面”现在变成了万维网上的网页,搜索引擎会为网络上的每个网页分配不同的页码。(没错,网页数量确实很多——最新统计显示有数十亿——但计算机非常擅长处理海量数据。)上图给出了一个更具体的例子。想象一下,万维网仅由图中所示的三个简短网页组成,每个网页的页码分别为 1、2 和 3。
The index for a web search engine works the same way as a book's index. The “pages” of the book are now web pages on the World Wide Web, and search engines assign a different page number to every single web page on the web. (Yes, there are a lot of pages—many billions at the last count—but computers are great at dealing with large numbers.) The figure above gives an example that will make this more concrete. Imagine that the World Wide Web consisted of only the 3 short web pages shown there, where the pages have been assigned page numbers 1,2, and 3.
带有页码的简单索引。
A simple index with page numbers.
计算机可以建立这三个网页的索引,方法是先列出所有出现在任何网页中的单词,然后按字母顺序排列该列表。我们将结果称为单词列表——在这个特定示例中,它将是“a、cat、dog、mat、on、sat、standed、the、while”。然后,计算机将逐字浏览这些页面。对于每个单词,它会在单词列表中相应单词的旁边标注当前页码。最终结果如上图所示。例如,您可以立即看到单词“cat”出现在第 1 页和第 3 页,但没有出现在第 2 页。而单词“while”仅出现在第 3 页。
A computer could build up an index of these three web pages by first making a list of all the words that appear in any page and then sorting that list in alphabetical order. Let's call the result a word lisf—in this particular case it would be “a, cat, dog, mat, on, sat, stood, the, while.” Then the computer would run through the pages word by word. For each word, it would make a note of the current page number next to the corresponding word in the word list. The final result is shown in the figure above. You can see immediately, for example, that the word “cat” occurs in pages 1 and 3, but not in page 2. And the word “while” appears only in page 3.
通过这种非常简单的方法,搜索引擎已经可以为许多简单的查询提供答案。例如,假设您输入查询“cat”。搜索引擎可以快速跳转到单词表中“cat”的条目。(由于单词表按字母顺序排列,计算机可以快速找到任何条目,就像人类可以快速在词典中查找单词一样。)一旦找到“cat”的条目,搜索引擎就可以直接给出该条目对应的页面列表——在本例中是1和3。现代搜索引擎会很好地格式化结果,并在每个返回的页面中附上一些小片段,但我们将主要忽略这类细节,而专注于搜索引擎如何知道哪些页码是您输入查询的“命中”。
With this very simple approach, a search engine can already provide the answers to a lot of simple queries. For example, suppose you enter the query cat. The search engine can quickly jump to the entry for cat in the word list. (Because the word list is in alphabetical order, a computer can quickly find any entry, just like a human can quickly find a word in a dictionary.) And once it finds the entry for cat, the search engine can just give you the list of pages at that entry—in this case, 1 and 3. Modern search engines format the results nicely, with little snippets from each of the pages that were returned, but we will mostly ignore details like that and concentrate on how search engines know which page numbers are “hits” for the query you entered.
再举一个非常简单的例子,让我们来看看查询“dog”的过程。在这种情况下,搜索引擎快速找到“dog”的条目,并返回结果2和3。但是,像“cat dog”这样的多词查询呢?这意味着您要查找同时包含“cat”和“dog”这两个词的页面。同样,搜索引擎利用现有索引可以轻松完成此操作。它首先分别查找这两个词,找出它们作为单个词出现的页面。对于“cat”,答案是1,3,“dog”是2,3。然后,计算机可以快速扫描两个命中列表,查找同时出现在两个列表中的页码。在这种情况下,页面1和2被拒绝,但页面3在两个列表中都出现,因此最终答案是第3页的单个命中。对于包含两个以上单词的查询,也适用非常类似的策略。例如,查询 cat the sat 将返回第 1 页和第 3 页作为匹配项,因为它们是“cat”(1, 3)、“the”(1, 2, 3) 和“sat”(1, 3) 列表的公共元素。
As another very simple example, let's check the procedure for the query dog. In this case, the search engine quickly finds the entry for dog and returns the hits 2 and 3. But how about a multiple-word query, like cat dog? This means you are looking for pages that contain both of the words “cat” and “dog.” Again, this is pretty easy for the search engine to do with the existing index. It first looks up the two words individually to find which pages they occur on as individual words. This gives the answer 1, 3 for “cat” and 2, 3 for “dog.” Then, the computer can quickly scan along both of the lists of hits, looking for any page numbers that occur on both lists. In this case, pages 1 and 2 are rejected, but page 3 occurs in both lists, so the final answer is a single hit on page 3. And a very similar strategy works for queries with more than two words. For example, the query cat the sat returns pages 1 and 3 as hits, since they are the common elements of the lists for “cat” (1, 3), “the” (1, 2, 3), and “sat” (1, 3).
到目前为止,构建搜索引擎听起来相当容易。最简单的索引技术似乎运行良好,即使对于多词查询也是如此。不幸的是,事实证明,这种简单的方法对于现代搜索引擎来说完全不够用。造成这种情况的原因有很多,但现在我们只集中讨论其中一个问题。那就是如何进行短语查询。短语查询是指搜索精确的短语,而不是仅仅搜索页面上任何位置出现的某些单词。在大多数搜索引擎中,短语查询都使用引号输入。因此,例如,查询“cat sat”与查询“cat sat”的含义截然不同。查询“cat sat”查找在任何地方以任意顺序包含两个单词“cat”和“sat”的页面;而查询“cat sat”查找包含单词“cat”后紧接着单词“sat”的页面。在我们这个简单的三页示例中,查询“cat sat”在第1页和第3页都有结果,但查询“cat sat”只在第1页返回一个结果。
So far, it sounds like building a search engine would be pretty easy. The simplest possible indexing technology seems to work just fine, even for multiword queries. Unfortunately, it turns out that this simple approach is completely inadequate for modern search engines. There are quite a few reasons for this, but for now we will concentrate on just one of the problems. This is the problem of how to do phrase queries. A phrase query is a query that searches for an exact phrase, rather than just the occurrence of some words anywhere on a page. On most search engines, phrase queries are entered using quotation marks. So, for example, the query “cat sat” has a very different meaning to the query cat sat. The query cat sat looks for pages that contain the two words “cat” and “sat” anywhere, in any order; whereas the query “cat sat” looks for pages that contain the word “cat” immediately followed by the word “sat.” In our simple three-page example, cat sat results in hits on pages 1 and 3, but “cat sat” returns only one hit, on page 1.
搜索引擎如何高效地执行短语查询?我们继续以“cat sat”为例。第一步似乎应该与普通的多词查询“cat sat”相同:从单词列表中检索每个单词出现的页面列表,在本例中,“cat”的页面出现在1, 3,“sat”的页面也出现在1, 3。但搜索引擎却遇到了瓶颈。它确切地知道这两个词出现在第1页和第3页,但却无法判断这两个词是否以正确的顺序相邻出现。你可能会想,此时搜索引擎可以回溯到原始网页,看看是否存在确切的短语。这确实是一个可行的解决方案,但效率非常低下。它需要通读每个可能包含该短语的网页的全部内容,而这样的网页数量可能非常庞大。请记住,我们这里处理的是一个只有三个页面的极小示例,但真正的搜索引擎必须对数百亿个网页给出正确的结果。
How can a search engine efficiently perform a phrase query? Let's stick with the “cat sat” example. It seems like the first step should be to do the same thing as for the ordinary multiword query cat sat: retrieve from the word list the list of pages that each word occurs on, in this case 1, 3 for “cat,” and the same thing—1, 3—for “sat.” But here the search engine is stuck. It knows for sure that both words occur on both pages 1 and 3, but there is no way of telling whether the words occur next to each other in the right order. You might think that at this point the search engine could go back and look at the original web pages to see if the exact phrase is there or not. This would indeed be a possible solution, but it is very, very inefficient. It requires reading through the entire contents of every web page that might contain the phrase, and there could be a huge number of such pages. Remember, we are dealing with an extremely small example of only three pages here, but a real search engine has to give correct results on tens of billions of web pages.
单词定位技巧
THE WORD-LOCATION TRICK
这个问题的解决方案是第一个真正巧妙的想法,它使现代搜索引擎运行良好:索引不仅应该存储页码,还应该存储页面内的位置。这些位置没有什么神秘的:它们只是指示单词在其页面中的位置。因此,第三个单词的位置是 3,第 29 个单词的位置是 29,依此类推。我们的整个三页数据集显示在下一页的上图中,其中添加了单词位置。在其下方是存储页码和单词位置的结果索引。我们将这种构建索引的方式称为“单词位置技巧”。让我们看几个例子来确保我们理解了单词位置技巧。索引的第一行是“a 3-5”。这意味着单词“a”在数据集中恰好出现一次,在第 3 页,它是该页面上的第 5 个单词。索引中最长的一行是“the 1-1 1-5 2-1 2-5 3-1”。这一行显示了数据集中所有单词“the”出现的确切位置。它在第 1 页出现了两次(位置 1 和 5),在第 2 页出现了两次(位置 1 和 5),在第 3 页出现了一次(位置 1)。
The solution to this problem is the first really ingenious idea that makes modern search engines work well: the index should not store only page numbers, but also locations within the pages. These locations are nothing mysterious: they just indicate the position of a word within its page. So the third word has location 3, the 29th word has location 29, and so on. Our entire three-page data set is shown in the top figure on the next page, with the word locations added. Below that is the index that results from storing both page numbers and word locations. We'll call this way of building an index the “word-location trick.” Let's look at a couple of examples to make sure we understand the word-location trick. The first line of the index is “a 3-5.” This means the word “a” occurs exactly once in the data set, on page 3, and it is the fifth word on that page. The longest line of the index is “the 1-1 1-5 2-1 2-5 3-1.” This line lets you know the exact locations of all occurrences of the word “the” in the data set. It occurs twice on page 1 (at locations 1 and 5), twice on page 2 (at locations 1 and 5), and once on page 3 (at location 1).
现在,回想一下我们引入这些页内单词位置的原因:它是为了解决如何高效地进行短语查询的问题。让我们看看如何使用这个新索引进行短语查询。我们将使用与之前相同的查询“cat sat”。第一步与旧索引相同:从索引中提取各个单词的位置,因此对于“cat”,我们得到 1-2、3-2,对于“sat”,我们得到 1-3、3-7。到目前为止,一切顺利:我们知道短语查询“cat sat”的唯一可能命中点出现在第 1 页和第 3 页。但就像之前一样,我们还不确定这个确切的短语是否出现在这些页面上——这两个词可能确实出现了,但顺序不正确。幸运的是,我们可以通过位置信息轻松检查这一点。让我们首先关注第 1 页。从索引信息中,我们知道“cat”出现在第 1 页的位置 2(这就是 1-2 的含义)。我们知道“sat”出现在第1页的第3位(这就是1-3的含义)。但如果“cat”在位置2,“sat”在位置3,那么我们知道“sat”紧跟在“cat”之后(因为3紧跟在2之后)——所以我们要找的整个短语“cat sat”一定出现在这一页,从位置2开始!
Now, remember why we introduced these in-page word locations: it was to solve the problem of how to do phrase queries efficiently. So let's see how to do a phrase query with this new index. We'll work with the same query as before, “cat sat”. The first steps are the same as with the old index: extract the locations of the individual words from the index, so for “cat” we get 1-2, 3-2, and for “sat” we get 1-3, 3-7. So far, so good: we know that the only possible hits for the phrase query “cat sat” can be on pages 1 and 3. But just like before, we are not yet sure whether that exact phrase occurs on those pages—it is possible that the two words do appear, but not next to each other in the correct order. Luckily, it is easy to check this from the location information. Let's concentrate on page 1 initially. From the index information, we know that “cat” appears at position 2 on page 1 (that's what the 1-2 means). And we know that “sat” appears at position 3 on page 1 (that's what the 1-3 means). But if “cat” is at position 2, and “sat” is at position 3, then we know “sat” appears immediately after “cat” (because 3 comes immediately after 2)—and so the entire phrase we are looking for, “cat sat,” must appear on this page beginning at position 2!
上图:我们添加了页内单词位置的三个网页。下图:包含页码和页内单词位置的新索引。
Top: Our three web pages with in-page word locations added. Bottom: A new index that includes both page numbers and in-page word locations.
我知道我费力地强调了这一点,但之所以要如此详细地解释这个例子,是为了理解究竟使用了哪些信息来得出这个答案。需要注意的是,我们只查看了索引信息(“cat” 的索引为 1-2、3-2,“sat” 的索引为 1-3、3-7),而不是原始网页本身,就找到了短语“cat sat”的匹配项。这一点至关重要,因为我们只需要查看索引中的两个条目,而不是浏览所有可能匹配的页面——而在实际的搜索引擎中,执行真正的短语查询时,这样的页面可能有数百万个。总结一下:索引中包含了页内单词的位置信息,使我们能够仅查看索引中的几行,而无需浏览大量网页,就能找到短语查询的匹配项。这个简单的单词定位技巧是让搜索引擎发挥作用的关键之一!
I know I am laboring this point, but the reason for going through this example in excruciating detail is to understand exactly what information is used to arrive at this answer. Note that we have found a hit for the phrase “cat sat” by looking only at the index information (1-2, 3-2 for “cat,” and 1-3, 3-7 for “sat”), not at the original web pages themselves. This is crucial, because we only had to look at the two entries in the index, rather than reading through all of the pages that might be hits—and there could be literally millions of such pages in a real search engine performing a real phrase query. To summarize: including the in-page word locations in the index has allowed us to find a phrase query hit by looking at only a couple of lines in the index, rather than reading through a large number of web pages. This simple word-location trick is one of the keys to making search engines work!
实际上,我们甚至还没有完成“cat sat”这个例子的处理。我们处理完了第1页的信息,但还没有处理完第3页。但第3页的推理是类似的:我们看到“cat”出现在位置2,“sat”出现在位置7,所以它们不可能相邻——因为位置7并不紧接在位置2之后。所以我们知道,第3页对于短语查询“cat sat”来说不是命中结果,即使它对于多词查询“cat sat”来说是一个命中结果。
Actually, we haven't even finished working through the “cat sat” example. We finished processing the information for page 1, but not for page 3. But the reasoning for page 3 is similar: we see that “cat” appears at location 2, and “sat” occurs at location 7, so they cannot possibly occur next to each other—because 7 is not immediately after 2. So we know that page 3 is not a hit for the phrase query “cat sat”, even though it /s a hit for the multiword query cat sat.
顺便说一句,单词定位技巧不仅仅对短语查询很重要。例如,考虑查找彼此接近的单词的问题。在某些搜索引擎上,您可以使用查询中的 NEAR 关键字来执行此操作。事实上,AltaVista 搜索引擎从早期就提供了此功能,并且在撰写本文时仍然如此。举一个具体的例子,假设在某个特定的搜索引擎上,查询 cat NEAR dog 会找到单词“cat”出现在单词“dog”五个单词以内的页面。我们如何在数据集上有效地执行此查询?使用单词定位很容易。“cat”的索引条目是 1-2, 3-2,“dog”的索引条目是 2-2, 3-6。所以我们立即看到第 3 页是唯一可能的匹配项。在第3页上,“cat”出现在位置2,“dog”出现在位置6。所以这两个词之间的距离是6 - 2,也就是4。因此,“cat”确实出现在“dog”的五个词以内,而第3页对于查询“cat NEAR dog”来说是命中的。再次提醒,我们执行此查询的效率非常高:无需阅读任何网页的实际内容——只需查询索引中的两个条目即可。
By the way, the word-location trick is important for more than just phrase queries. As one example, consider the problem of finding words that are near to each other. On some search engines, you can do this with the NEAR keyword in the query. In fact, the AltaVista search engine offered this facility from its early days and still does at the time of writing. As a specific example, suppose that on some particular search engine, the query cat NEAR dog finds pages in which the word “cat” occurs within five words of the word “dog.” How can we perform this query efficiently on our data set? Using word locations, it's easy. The index entry for “cat” is 1-2, 3-2, and the index entry for “dog” is 2-2, 3-6. So we see immediately that page 3 is the only possible hit. And on page 3, “cat” appears at location 2, and “dog” appears at location 6. So the distance between the two words is 6 – 2, which is 4. Therefore, “cat” does appear within five words of “dog,” and page 3 is a hit for the query cat NEAR dog. Again, note how efficiently we could perform this query: there was no need to read through the actual content of any web pages—instead, only two entries from the index were consulted.
事实证明,NEAR 查询在实际应用中对搜索引擎用户来说并不重要。几乎没有人使用 NEAR 查询,大多数主流搜索引擎甚至不支持它们。但尽管如此,执行 NEAR 查询的能力对于实际的搜索引擎来说却至关重要。这是因为搜索引擎本身在后台不断执行 NEAR 查询。要理解其中的原因,我们首先必须了解现代搜索引擎面临的另一个主要问题:排名问题。
It turns out that NEAR queries aren't very important to search engine users in practice. Almost no one uses NEAR queries, and most major search engines don't even support them. But despite this, the ability to perform NEAR queries is actually crucial to real-life search engines. This is because the search engines themselves are constantly performing NEAR queries behind the scenes. To understand why, we first have to take a look at one of the other major problems that confronts modern search engines: the problem of ranking.
排名和接近度
RANKING AND NEARNESS
到目前为止,我们一直专注于匹配阶段:高效地找到给定查询的所有匹配结果。但正如之前强调的那样,第二阶段“排名”对于高质量的搜索引擎至关重要:在这个阶段,我们会挑选出排名靠前的几个匹配结果并展示给用户。
So far, we've been concentrating on the matching phase: the problem of efficiently finding all of the hits for a given query. But as emphasized earlier, the second phase, “ranking,” is absolutely essential for a high-quality search engine: this is the phase that picks out the top few hits for display to the user.
让我们更仔细地研究一下排名的概念。一个页面的“排名”究竟取决于什么?真正的问题不是“这个页面是否符合查询条件?”,而是“这个页面是否与查询相关?”计算机科学家使用“相关性”一词来描述一个页面在响应特定查询时的适用性或实用性。
Let's examine the concept of ranking a little more carefully. What does the “rank” of a page really depend on? The real question is not “Does this page match the query?” but rather “Is this page relevant to the query?” Computer scientists use the term “relevance” to describe how suitable or useful a given page is, in response to a particular query.
举个具体的例子,假设你对疟疾的病因感兴趣,并在搜索引擎中输入了查询“疟疾原因”。为简单起见,假设搜索引擎中该查询只有两个结果——如下图所示的两个页面。现在看看这两个页面。作为人类,你应该立刻就能明白,第 1 页确实是关于疟疾原因的,而第 2 页似乎是对某个军事行动的描述,碰巧使用了“原因”和“疟疾”这两个词。因此,第 1 页无疑比第 2 页与查询“疟疾原因”更“相关”。但计算机不是人类,计算机无法轻易理解这两个页面的主题,因此搜索引擎似乎不可能对这两个结果进行正确的排名。
As a concrete example, suppose you are interested in what causes malaria, and you enter the query malaria cause into a search engine. To keep things simple, imagine there are only two hits for that query in the search engine—the two pages shown in the figure on the following page. Have a look at those pages now. It should be immediately clear to you, as a human, that page 1 is indeed about the causes of malaria, whereas page 2 seems to be the description of some military campaign which just happens, by coincidence, to use the words “cause” and “malaria.” So page 1 is undoubtedly more “relevant” to the query malaria cause than page 2. But computers are not humans, and there is no easy way for a computer to understand the topics of these two pages, so it might seem impossible for a search engine to rank these two hits correctly.
上图:两个提及疟疾的示例网页。
下图:由以上两个网页构建的索引部分。
Top: Two example web pages that mention malaria.
Bottom: Part of the index built from the above two web pages.
然而,实际上,在这种情况下,有一个非常简单的方法可以确保排名正确。事实证明,查询词出现位置较近的页面比查询词出现位置较远的页面更有可能具有相关性。在疟疾的例子中,我们看到“malaria”和“cause”这两个词在第1页中相隔两个词,但在第2页中却相隔17个词。(请记住,搜索引擎只需查看索引条目即可高效地发现这一点,而无需回头查看网页本身。)因此,尽管计算机并不真正“理解”此查询的主题,但它可以猜测第1页比第2页更相关,因为查询词在第1页上出现的距离比在第2页上更近。
However, there is, in fact, a very simple way to get the ranking right in this case. It turns out that pages where the query words occur near each other are more likely to be relevant than pages where the query words are far apart. In the malaria example, we see that the words “malaria” and “cause” occur within two words of each other in page 1, but are separated by 17 words in page 2. (And remember, the search engine can find this out efficiently by looking at just the index entries, without having to go back and look at the web pages themselves.) So although the computer doesn't really “understand” the topic of this query, it can guess that page 1 is more relevant than page 2, because the query words occur much closer on page 1 than on page 2.
总结一下:虽然人类并不怎么使用 NEAR 查询,但搜索引擎会不断使用有关接近度的信息来提高其排名——而它们之所以能有效地做到这一点,是因为它们使用了词语位置技巧。
To summarize: although humans don't use NEAR queries much, search engines use the information about nearness constantly to improve their rankings—and the reason they can do this efficiently is because they use the word-location trick.
一组示例网页,每个网页都有标题和正文。
An example set of web pages that each have a title and a body.
我们已经知道,巴比伦人在搜索引擎出现之前5000年就已经开始使用索引了。事实证明,搜索引擎也并非“词定位技巧”的发明者:这是一种众所周知的技术,在互联网出现之前,它曾被用于其他类型的信息检索。然而,在下一节中,我们将了解一种似乎由搜索引擎设计者发明的新技巧:元词技巧。对这一技巧及其相关理念的巧妙运用,帮助AltaVista搜索引擎在20世纪90年代末一跃成为搜索行业的领军企业。
We already know that the Babylonians were using indexing 5000 years before search engines existed. It turns out that search engines did not invent the word-location trick either: this is a well-known technique that was used in other types of information retrieval before the internet arrived on the scene. However, in the next section we will learn about a new trick that does appear to have been invented by search engine designers: the metaword trick. The cunning use of this trick and various related ideas helped to catapult the AltaVista search engine to the top of the search industry in the late 1990s.
元词技巧
THE METAWORD TRICK
到目前为止,我们一直在使用极其简单的网页示例。您可能知道,大多数网页都包含相当多的结构,包括标题、标题、链接和图片,而我们一直以来都将网页视为普通的单词列表。现在,我们将了解搜索引擎如何考虑网页的结构。但为了尽可能简化,我们只介绍结构化的一个方面:我们将允许网页在页面顶部显示标题,然后是页面正文。上图展示了我们熟悉的三页示例,其中添加了部分标题。
So far, we've been using extremely simple examples of web pages. As you probably know, most web pages have quite a lot of structure, including titles, headings, links, and images, whereas we have been treating web pages as just ordinary lists of words. We're now going to find out how search engines take account of the structure in web pages. But to keep things as simple as possible, we will introduce only one aspect of structuring: we will allow our pages to have a title at the top of the page, followed by the body of the page. The figure above shows our familiar three-page example with some titles added.
实际上,要像搜索引擎那样分析网页结构,我们需要更多地了解网页的编写方式。网页由一种特殊的语言编写而成,这种语言允许网络浏览器以良好的格式显示网页。(最常用的网页语言是 HTML,但 HTML 的细节对于本文的讨论并不重要。)标题、标题、链接、图像等的格式指令使用称为元词的特殊词语来编写。例如,用于网页标题开头的元词可能是 <titleStart>,而用于网页标题结尾的元词可能是 <titleEnd>。同样,网页正文可以以 <bodyStart> 开头,以 <bodyEnd> 结尾。不要让符号“<”和“>”混淆。它们出现在大多数计算机键盘上,通常以其数学含义“小于”和“大于”来表示。但在这里,它们与数学毫无关系——它们只是被用作方便的符号,以标记元词与网页上的常规词的不同。
Actually, to analyze web page structure in the same way that search engines do, we need to know a little more about how web pages are written. Web pages are composed in a special language that allows web browsers to display them in a nicely formatted way. (The most common language for this purpose is called HTML, but the details of HTML are not important for this discussion.) The formatting instructions for headings, titles, links, images, and the like are written using special words called metawords. As an example, the metaword used to start the title of a web page might be <titleStart>, and the metaword for ending the title might be <titleEnd>. Similarly, the body of the web page could be started with <bodyStart> and ended with <bodyEnd>. Don't let the symbols “<” and “>” confuse you. They appear on most computer keyboards and are often known by their mathematical meanings as “less than” and “greater than.” But here, they have nothing whatsoever to do with math—they are just being used as convenient symbols to mark the metawords as different from regular words on a web page.
与上一张图相同的一组网页,但显示的是使用元词编写的网页,而不是在网络浏览器中显示的网页。
The same set of web pages as in the last figure, but shown as they might be written with metawords, rather than as they would be displayed in a web browser.
请看上图,它显示的内容与上图完全相同,但现在显示的是网页的实际编写方式,而不是它们在网络浏览器中的显示方式。大多数网络浏览器都允许您通过选择名为“查看源代码”的菜单选项来查看网页的原始内容——我建议您下次有机会时尝试一下。(请注意,这里使用的元词,例如 <titleStart> 和 <titleEnd>,是虚构的、易于识别的示例,旨在帮助我们理解。在真正的 HTML 中,元词被称为标签。HTML中表示起始和结束标题的标签是 <title> 和 </title>——请在使用“查看源代码”菜单选项后搜索这些标签。)
Take a look at the figure above, which displays exactly the same content as the previous figure, but now showing how the web pages were actually written, rather than how they would be displayed in a web browser. Most web browsers allow you to examine the raw content of a web page by choosing a menu option called “view source”—I recommend experimenting with this the next time you get a chance. (Note that the metawords used here, such as <titleStart> and <titleEnd>, are fictitious, easily recognizable examples to aid our understanding. In real HTML, metawords are called tags. The tags for starting and ending titles in HTML are <title> and </title>—search for these tags after using the “view source” menu option.)
在构建索引时,将所有元词纳入其中非常简单。无需任何新技巧:只需像存储常规单词一样存储元词的位置即可。下一页的图表展示了使用元词构建的三个网页的索引。仔细观察这张图,确保你明白其中的奥妙。例如,“mat”的条目是 1-11, 2-11,这意味着“mat”是第 1 页的第 11 个单词,也是第 2 页的第 11 个单词。元词的工作方式相同,因此“<titleEnd>”的条目是 1-4, 2-4, 3-4,这意味着“<titleEnd>”是第 1 页、第 2 页和第 3 页的第 4 个单词。
When building an index, it is a simple matter to include all of the metawords. No new tricks are needed: you just store the locations of the metawords in the same way as regular words. The figure on the next page shows the index built from the three web pages with metawords. Take a look at this figure and make sure you understand there is nothing mysterious going on here. For example, the entry for “mat” is 1-11, 2-11, which means that “mat” is the 11th word on page 1 and also the 11th word on page 2. The metawords work the same way, so the entry for “<titleEnd>,” which is 1-4, 2-4, 3-4, means that “<titleEnd>” is the fourth word in page 1, page 2, and page 3.
我们将这个简单的技巧——以与普通单词相同的方式索引元词——称为“元词技巧”。它可能看起来简单得可笑,但这个元词技巧在搜索引擎执行精确搜索和高质量排名方面起着至关重要的作用。让我们看一个简单的例子。假设一个搜索引擎支持一种使用 IN 关键词的特殊查询,那么像 boat IN TITLE 这样的查询只会返回网页标题中包含单词“boat”的页面,而 giraffe IN BODY 则会找到正文中包含“giraffe”的页面。请注意,大多数真正的搜索引擎并不提供完全相同的 IN 查询,但有些搜索引擎允许您通过点击“高级搜索”选项来实现同样的效果,您可以在该选项中指定查询词必须出现在标题或文档的其他特定部分。我们假设 IN 关键词的存在纯粹是为了方便解释。事实上,在撰写本文时,Google 允许你使用关键词 intitle: 进行标题搜索,因此 Google 查询 intitle:boat 会找到标题中包含“boat”的页面。不妨亲自尝试一下!
We'll call this simple trick, of indexing metawords in the same way as normal words, the “metaword trick.” It might seem ridiculously simple, but this metaword trick plays a crucial role in allowing search engines to perform accurate searches and high-quality rankings. Let's look at a simple example of this. Suppose for a moment that a search engine supports a special type of query using the IN keyword, so that a query like boat IN TITLE returns hits only for pages that have the word “boat” in the title of the web page, and giraffe IN BODY would find pages whose body contains “giraffe.” Note that most real search engines do not provide IN queries in exactly this way, but some of them let you achieve the same effect by clicking on an “advanced search” option where you can specify that your query words must be in the title, or some other specific part of a document. We are pretending that the IN keyword exists purely to make our explanations easier. In fact, at the time of writing, Google lets you do a title search using the keyword intitle:, so the Google query intitle:boat finds pages with “boat” in the title. Try it for yourself!
上图所示的网页索引,包括元词。
The index for the web pages shown in the previous figure, including metawords.
搜索引擎如何在标题中执行搜索。
How a search engine performs the search dog IN TITLE.
让我们看看搜索引擎如何在最后两幅图所示的三页示例中高效地执行查询 dog IN TITLE。首先,它提取“dog”的索引条目,即 2-3、2-7、3-11。然后(这可能有点出乎意料,但请稍等片刻)它提取 <titleStart> 和 <titleEnd> 的索引条目。结果为 <titleStart> 的索引条目为1-1、2-1、3-1,而 <titleEnd> 的索引条目为 1-4、2-4、3-4。上图总结了迄今为止提取的信息——您现在可以忽略圆圈和方框。
Let's see how a search engine could efficiently perform the query dog IN TITLE on the three-page example shown in the last two figures. First, it extracts the index entry for “dog,” which is 2-3, 2-7, 3-11. Then (and this might be a little unexpected, but bear with me for a second) it extracts the index entries for both <titleStart> and <titleEnd>. That results in 1-1, 2-1, 3-1 for <titleStart> and 1-4, 2-4, 3-4 for <titleEnd>. The information extracted so far is summarized in the figure above—you can ignore the circles and boxes for now.
然后,搜索引擎开始扫描“dog”的索引条目,检查每个匹配项,并检查它是否出现在标题中。“dog”的第一个匹配项是带圈的条目 2-3,对应于第 2 页的第三个单词。通过扫描 <titleStart> 条目,搜索引擎可以找到第 2 页标题的起始位置——应该是第一个以“2-”开头的数字。在本例中,它到达了带圈的条目 2-1,这意味着第 2 页的标题从第 1 个单词开始。同样,搜索引擎可以找到第 2 页标题的结束位置。它只需扫描 <titleEnd> 条目,寻找以“2-”开头的数字,然后停在带圈的条目 2-4 处。因此,第 2 页的标题在第 4 个单词处结束。
The search engine then starts scanning the index entry for “dog,” examining each of its hits and checking whether or not it occurs inside a title. The first hit for “dog” is the circled entry 2-3, corresponding to the third word of page number 2. By scanning along the entries for <titleStart>, the search engine can find out where the title for page 2 begins—that should be the first number that starts with “2-.” In this case it arrives at the circled entry 2-1, which means that the title for page 2 begins at word number 1. In the same way, the search engine can find out where the title for page 2 ends. It just scans along the entries for <titleEnd>, looking for a number that starts with “2-,” and therefore stops at the circled entry 2-4. So page 2's title ends at word 4.
到目前为止,我们所知道的一切都可以通过图中圈出的条目来总结,这些条目告诉我们第 2 页的标题从第 1 个单词开始到第 4 个单词结束,而单词“dog”出现在第 3 个单词中。最后一步很简单:因为 3 大于 1 小于 4,所以我们确信这个单词“dog”的命中确实出现在标题中,因此第 2 页应该是查询 dog IN TITLE 的命中。
Everything we know so far is summarized by the circled entries in the figure, which tell us the title for page 2 starts at word 1 and ends at word 4, and the word “dog” occurs at word 3. The final step is easy: because 3 is greater than 1 and less than 4, we are certain that this hit for the word “dog” does indeed occur in a title, and therefore page 2 should be a hit for the query dog IN TITLE.
搜索引擎现在可以移动到“dog”的下一个匹配项。这恰好是 2-7(第 2 页的第 7 个单词),但因为我们已经知道第 2 页是匹配项,所以我们可以跳过这个条目,直接转到下一个匹配项 3-11,它被一个方框标记。这告诉我们“dog”出现在第 3 页的第 11 个单词处。因此,我们开始扫描 <titleStart> 和 <titleEnd> 行中当前圈出的位置,查找以“3-”开头的条目。(需要注意的是,我们不必回到每一行的开头——我们可以从上一个匹配项中断的位置继续扫描。)在这个简单的例子中,以“3-”开头的条目恰好是这两个例子中的下一个数字——<titleStart> 是 3-1,<titleEnd> 是 3-4。这两个匹配项都用方框标记,以便于参考。再次,我们的任务是确定当前在 3-11 处的“dog”匹配项是否位于标题内。嗯,方框中的信息告诉我们,在第 3 页上,“dog”出现在第 11 个单词处,而标题从第 1 个单词开始并结束于第 4 个单词。因为 11 大于 4,所以我们知道“dog”出现在标题结束之后,因此不在标题中 - 所以第 3 页对于查询 dog IN TITLE 来说不是命中项。
The search engine can now move to the next hit for “dog.” This happens to be 2-7 (the seventh word of page 2), but because we already know that page 2 is a hit, we can skip over this entry and move on to the next one, 3-11, which is marked by a box. This tells us that “dog” occurs at word 11 on page 3. So we start scanning past the current circled locations in the rows for <titleStart> and <titleEnd>, looking for entries that start with “3-.” (It's important to note that we do not have to go back to the start of each row—we can pick up wherever we left off scanning from the previous hit.) In this simple example, the entry starting with “3-” happens to be the very next number in both cases—3-1 for <titleStart> and 3-4 for <titleEnd>. These are both marked by boxes for easy reference. Once again, we have the task of determining whether the current hit for “dog” at 3-11 is located inside a title or not. Well, the information in boxes tells us that on page 3, “dog” is at word 11, whereas the title begins at word 1 and ends at word 4. Because 11 is greater than 4, we know that this occurrence of “dog” occurs after the end of the title and is therefore not in the title—so page 3 is not a hit for the query dog IN TITLE.
因此,元词技巧可以让搜索引擎以极其高效的方式回答有关文档结构的查询。上面的例子仅适用于页面标题内的搜索,但非常类似的技术可以让你搜索超链接、图片描述以及网页其他各种有用部分中的单词。所有这些查询都可以像上面的例子一样高效地得到回答。就像我们之前讨论过的查询一样,搜索引擎无需回溯查看原始网页:它只需查阅少量索引条目即可回答查询。而且,同样重要的是,它只需扫描每个索引条目一次。回想一下,当我们处理完第 2 页的第一个匹配项并转到第 3 页的可能匹配项时发生了什么:搜索引擎无需返回到 <titleStart> 和 <titleEnd> 条目的开头,而是可以从上次中断的地方继续扫描。这是提高 IN 查询效率的关键因素。
So, the metaword trick allows a search engine to answer queries about the structure of a document in an extremely efficient way. The example above was only for searching inside page titles, but very similar techniques allow you to search for words in hyperlinks, image descriptions, and various other useful parts of web pages. And all of these queries can be answered as efficiently as the example above. Just like the queries we discussed earlier, the search engine does not need to go back and look at the original web pages: it can answer the query by consulting just a small number of index entries. And, just as importantly, it only needs to scan through each index entry once. Remember what happened when we had finished processing the first hit on page 2 and moved to the possible hit on page 3: instead of going back to the start of the entries for <titleStart> and <titleEnd>, the search engine could continue scanning from where it had left off. This is a crucial element in making the IN query efficient.
标题查询和其他依赖于网页结构的“结构化查询”与前面讨论过的 NEAR 查询类似,因为人类很少使用结构化查询,但搜索引擎却一直在内部使用它们。原因与前面相同:搜索引擎的生死取决于其排名,而利用网页结构可以显著提升排名。例如,标题中包含“狗”的页面比仅在正文中提及“狗”的页面更有可能包含有关狗的信息。因此,当用户输入简单查询“狗”时,搜索引擎可以在内部执行“狗在标题中”搜索(即使用户没有明确要求这样做),以查找最有可能与狗相关的页面,而不是仅仅碰巧提及狗的页面。
Title queries and other “structure queries” that depend on the structure of a web page are similar to the NEAR queries discussed earlier, in that humans rarely employ structure queries, but search engines use them internally all the time. The reason is the same as before: search engines live or die by their rankings, and rankings can be significantly improved by exploiting the structure of web pages. For example, pages that have “dog” in the title are much more likely to contain information about dogs than pages that mention “dog” only in the body of the page. So when a user enters the simple query dog, a search engine could internally perform a dog IN TITLE search (even though the user did not explicitly request that) to find pages that are most likely to be about dogs, rather than just happening to mention dogs.
索引和匹配技巧并非故事的全部
INDEXING AND MATCHING TRICKS ARE NOT THE WHOLE STORY
构建一个网络搜索引擎并非易事。最终的产品就像一台极其复杂的机器,拥有众多不同的轮子、齿轮和杠杆,必须正确设置才能使系统正常运行。因此,重要的是要意识到本章介绍的两个技巧本身并不能解决构建有效搜索引擎索引的问题。然而,词语定位技巧和元词技巧确实展现了真实搜索引擎构建和使用索引的本质。
Building a web search engine is no easy task. The final product is like an enormously complex machine with many different wheels, gears, and levers, which must all be set correctly for the system to be useful. Therefore, it is important to realize that the two tricks presented in this chapter do not by themselves solve the problem of building an effective search engine index. However, the word-location trick and the metaword trick certainly convey the flavor of how real search engines construct and use indexes.
元词技巧确实帮助AltaVista成功——而其他搜索引擎却未能成功——找到了覆盖整个网络的有效匹配。我们之所以知道这一点,是因为AltaVista在1999年提交的一份名为“索引的约束搜索”的美国专利申请中描述了元词技巧。然而,AltaVista精心设计的匹配算法并不足以让它在搜索行业动荡的早期维持下去。众所周知,对于一个高效的搜索引擎来说,高效匹配只是成功的一半:另一个巨大的挑战是对匹配页面进行排名。正如我们将在下一章看到的,一种新型排名算法的出现足以超越AltaVista,并让谷歌一跃成为网络搜索领域的领军企业。
The metaword trick did help AltaVista succeed—where others had failed—in finding efficient matches to the entire web. We know this because the metaword trick is described in a 1999 U.S. patent filing by AltaVista, entitled “Constrained Searching of an Index.” However, AltaVista's superbly crafted matching algorithm was not enough to keep it afloat in the turbulent early days of the search industry. As we already know, efficient matching is only half the story for an effective search engine: the other grand challenge is to rank the matching pages. And as we will see in the next chapter, the emergence of a new type of ranking algorithm was enough to eclipse AltaVista, vaulting Google into the forefront of the world of web search.
3
3
PageRank:谷歌诞生的技术
PageRank: The Technology That Launched Google
—拉里·佩奇(谷歌联合创始人)
—LARRY PAGE (Google cofounder)
从建筑学角度来看,车库通常显得不起眼。但在硅谷,车库却有着特殊的创业意义:许多伟大的硅谷科技公司都诞生于,或至少是在车库里孵化出来的。这种趋势并非始于 20 世纪 90 年代的互联网泡沫时期。50 多年前,也就是 1939 年,世界经济仍在大萧条的阴影下挣扎,惠普公司就在加州帕洛阿尔托市戴夫·休利特的车库里起步。几十年后的 1976 年,史蒂夫·乔布斯和史蒂夫·沃兹尼亚克在加州洛斯阿尔托斯市乔布斯的车库里创立了如今传奇的苹果电脑公司。 (尽管普遍的说法是苹果公司是在车库里创立的,但乔布斯和沃兹尼亚克最初实际上是在卧室里工作的。很快他们就没有足够的空间,就搬进了车库。)但也许比惠普和苹果的成功故事更引人注目的是名为谷歌的搜索引擎的推出,该公司于 1998 年 9 月首次注册成立时,就在加利福尼亚州门洛帕克的一个车库里运营。
Architecturally speaking, the garage is typically a humble entity. But in Silicon Valley, garages have a special entrepreneurial significance: many of the great Silicon Valley technology companies were born, or at least incubated, in a garage. This is not a trend that began in the dot-com boom of the 1990s. Over 50 years earlier—in 1939, with the world economy still reeling from the Great Depression—Hewlett-Packard got underway in Dave Hewlett's garage in Palo Alto, California. Several decades after that, in 1976, Steve Jobs and Steve Wozniak operated out of Jobs' garage in Los Altos, California, after founding their now-legendary Apple computer company. (Although popular lore has it that Apple was founded in the garage, Jobs and Wozniak actually worked out of a bedroom at first. They soon ran out of space and moved into the garage.) But perhaps even more remarkable than the HP and Apple success stories is the launch of a search engine called Google, which operated out of a garage in Menlo Park, California, when first incorporated as a company in September 1998.
那时,谷歌的网络搜索服务实际上已经运行了一年多——最初是在斯坦福大学的服务器上运行的,两位联合创始人当时都是斯坦福大学的博士生。直到这项日益流行的服务对带宽的需求超出斯坦福大学的承受能力时,两位学生拉里·佩奇和谢尔盖·布林才将业务转移到如今闻名遐迩的门洛帕克车库。他们的做法一定非常正确,因为在谷歌正式注册成立仅三个月后,它就被《个人电脑杂志》评选为1998年百强网站之一。
By that time, Google had in fact already been running its web search service for well over a year—initially from servers at Stanford University, where both of the cofounders were Ph.D. students. It wasn't until the bandwidth requirements of the increasingly popular service became too much for Stanford that the two students, Larry Page and Sergey Brin, moved the operation into the now-famous Menlo Park garage. They must have been doing something right, because only three months after its legal incorporation as a company, Google was named by PC Magazine as one of the top 100 websites for 1998.
我们的故事才真正开始:用《个人电脑杂志》的话来说,谷歌之所以能获得精英地位,是因为它“拥有得出极其相关结果的不可思议的本领”。你可能还记得上一章的内容,第一个商业搜索引擎是在四年前,也就是 1994 年推出的。当时还只是个车库少年的谷歌是如何克服这四年来的巨大差距,在搜索质量上超越当时已经很受欢迎的 Lycos 和 AltaVista 的呢?这个问题没有简单的答案。但最重要的因素之一,尤其是在早期,是谷歌用于对搜索结果进行排名的创新算法:一种名为PageRank 的算法。
And here is where our story really begins: in the words of PC Magazine, Google's elite status was awarded for its “uncanny knack for returning extremely relevant results.” You may recall from the last chapter that the first commercial search engines had been launched four years earlier, in 1994. How could the garage-bound Google overcome this phenomenal four-year deficit, leapfrogging the already-popular Lycos and AltaVista in terms of search quality? There is no simple answer to this question. But one of the most important factors, especially in those early days, was the innovative algorithm used by Google for ranking its search results: an algorithm known as PageRank.
“PageRank”这个名字源于一个双关语:它是一种网页排名算法,但同时也是其主要发明人拉里·佩奇的排名算法。佩奇和布林于1998年在一篇学术会议论文《大型超文本网络搜索引擎的剖析》中发表了该算法。正如标题所示,这篇论文的意义远不止于PageRank的描述。事实上,它完整地描述了1998年谷歌系统的状态。然而,隐藏在系统技术细节之下的,或许是21世纪第一个算法瑰宝:PageRank算法。在本章中,我们将探讨该算法如何以及为何能够大海捞针,始终将最相关的结果作为搜索查询的热门结果。
The name “PageRank” is a pun: it's an algorithm that ranks web pages, but it's also the ranking algorithm of Larry Page, its chief inventor. Page and Brin published the algorithm in 1998, in an academic conference paper, “The Anatomy of a Large-scale Hypertextual Web Search Engine.” As its title suggests, this paper does much more than describe PageRank. It is, in fact, a complete description of the Google system as it existed in 1998. But buried in the technical details of the system is a description of what may well be the first algorithmic gem to emerge in the 21st century: the PageRank algorithm. In this chapter, we'll explore how and why this algorithm is able to find needles in haystacks, consistently delivering the most relevant results as the top hits to a search query.
超链接技巧
THE HYPERLINK TRICK
您可能已经知道什么是超链接:它是网页上的一个短语,点击后会跳转到另一个网页。大多数网络浏览器会用蓝色下划线显示超链接,以便于识别。
You probably already know what a hyperlink is: it is a phrase on a web page that takes you to another web page when you click on it. Most web browsers display hyperlinks underlined in blue so that they stand out easily.
超链接的概念出奇地古老。1945年——大约在电子计算机首次被开发出来的同一时期——美国工程师万尼瓦尔·布什发表了一篇富有远见的文章,题为《诚如我们所想》。在这篇内容广泛的文章中,布什描述了一系列潜在的新技术,其中包括一种他称之为memex的机器。memex可以存储文档并自动为其编制索引,但它的功能远不止于此。它能够实现“关联索引……即任何条目都可以被随意地立即自动地选择另一个条目”——换句话说,这是一种超链接的雏形!
Hyperlinks are a surprisingly old idea. In 1945 — around the same time that electronic computers themselves were first being developed — the American engineer Vannevar Bush published a visionary essay entitled “As We May Think.” In this wide-ranging essay, Bush described a slew of potential new technologies, including a machine he called the memex. A memex would store documents and automatically index them, but it would also do much more. It would allow “associative indexing,…whereby any item may be caused at will to select immediately and automatically another”—in other words, a rudimentary form of hyperlink!
超链接技巧的原理。图中展示了六个网页,每个网页都用一个方框表示。其中两个网页是炒鸡蛋食谱,另外四个网页包含指向这些食谱的超链接。超链接技巧将伯特的网页排名高于厄尼的网页,因为伯特有三个指向链接,而厄尼只有一个。
The basis of the hyperlink trick. Six web pages are shown, each represented by a box. Two of the pages are scrambled egg recipes, and the other four are pages that have hyperlinks to these recipes. The hyperlink trick ranks Bert's page above Ernie's, because Bert has three incoming links and Ernie only has one.
超链接自 1945 年以来已经出现。它们是搜索引擎用来执行排名的最重要的工具之一,也是 Google PageRank 技术的基础,我们现在将开始认真探索它。
Hyperlinks have come along way since 1945. They are one of the most important tools used by search engines to perform ranking, and they are fundamental to Google's PageRank technology, which we'll now begin to explore in earnest.
理解 PageRank 的第一步是一个简单的概念,我们称之为超链接技巧。这个技巧最容易通过一个例子来解释。假设你对如何制作炒鸡蛋感兴趣,并且你在网上搜索了这个主题。现在,任何关于炒鸡蛋的真实网络搜索都会出现数百万个结果,但为了简单起见,我们假设只出现两个页面——一个名为“Ernie 的炒鸡蛋食谱”,另一个名为“Bert 的炒鸡蛋食谱”。这些页面如上图所示,以及其他一些包含指向 Bert 或 Ernie 食谱的超链接的网页。为了(再次)简单起见,我们假设显示的四个页面是整个互联网上仅有的链接到我们两个炒鸡蛋食谱的页面。超链接显示为带下划线的文本,并带有箭头指示链接指向的位置。
The first step in understanding PageRank is a simple idea we'll call the hyperlink trick. This trick is most easily explained by an example. Suppose you are interested in learning how to make scrambled eggs and you do a web search on that topic. Now any real web search on scrambled eggs turns up millions of hits, but to keep things really simple, let's imagine that only two pages come up—one called “Ernie's scrambled egg recipe” and the other called “Bert's scrambled egg recipe.” These are shown in the figure above, together with some other web pages that have hyperlinks to either Bert's recipe or Ernie's. To keep things simple (again), let's imagine that the four pages shown are the only pages on the entire web that link to either of our two scrambled egg recipes. The hyperlinks are shown as underlined text, with arrows to show where the link goes to.
问题是,这两个热门食谱中,哪个应该排名更高,Bert 还是 Ernie?作为人类,我们很容易就能阅读链接到这两个食谱的页面并做出判断。看起来这两个食谱都还不错,但人们对 Bert 的食谱的热情远高于 Ernie 的。因此,在没有其他信息的情况下,将 Bert 排在 Ernie 之上可能更合理。
The question is, which of the two hits should be ranked higher, Bert or Ernie? As humans, it's not much trouble for us to read the pages that link to the two recipes and make a judgment call. It seems that both of the recipes are reasonable, but people are much more enthusiastic about Bert's recipe than Ernie's. So in the absence of any other information, it probably makes more sense to rank Bert above Ernie.
遗憾的是,计算机并不擅长理解网页的实际含义,因此搜索引擎无法检查链接到这些结果的四个页面,并评估每个食谱的推荐强度。另一方面,计算机擅长计数。因此,一种简单的方法是计算链接到每个食谱的页面数量(在本例中,Ernie 的页面一个,Bert 的页面三个),然后根据食谱的入站链接数量进行排名。当然,这种方法远不如人工阅读所有页面并手动确定排名那么准确,但它仍然是一种有用的技术。事实证明,如果没有其他信息,网页的入站链接数量可以作为衡量该页面实用性或“权威性”的有用指标。在本例中,得分是 Bert 3,Ernie 1,因此当搜索引擎将结果呈现给用户时,Bert 的页面排名高于 Ernie 的页面。
Unfortunately, computers are not good at understanding what a web page actually means, so it is not feasible for a search engine to examine the four pages linking to the hits and make an assessment of how strongly each recipe is recommended. On the other hand, computers are excellent at counting things. So one simple approach is to simply count the number of pages that link to each of the recipes—in this case, one for Ernie, and three for Bert—and rank the recipes according to how many incoming links they have. Of course, this approach is not nearly as accurate as having a human read all the pages and determine a ranking manually, but it is nevertheless a useful technique. It turns out that, if you have no other information, the number of incoming links that a web page has can be a helpful indicator of how useful, or “authoritative,” the page is likely to be. In this case, the score is Bert 3, Ernie 1, so Bert's page gets ranked above Ernie's when the search engine's results are presented to the user.
您可能已经发现这种用于排名的“超链接技巧”存在一些问题。一个明显的问题是,有时链接会指向不好的页面而不是好的页面。例如,假设一个网页链接到 Ernie 的菜谱,并说“我试过Ernie 的菜谱,味道糟透了”。像这样的链接,批评而不是推荐一个页面,确实会导致超链接技巧将页面排名提高到高于其应有的高度。但事实证明,在实践中,超链接更多是推荐而不是批评,因此尽管存在这个明显的缺陷,超链接技巧仍然有用。
You can probably already see some problems with this “hyperlink trick” for ranking. One obvious issue is that sometimes links are used to indicate bad pages rather than good ones. For example, imagine a web page that linked to Ernie's recipe by saying, “I tried Ernie's recipe, and it was awful.” Links like this one, that criticize a page rather than recommend it, do indeed cause the hyperlink trick to rank pages more highly than they deserve. But it turns out that, in practice, hyperlinks are more often recommendations than criticisms, so the hyperlink trick remains useful despite this obvious flaw.
权威诡计
THE AUTHORITY TRICK
您可能已经想知道,为什么指向一个页面的所有入站链接都应该被平等对待。专家的推荐难道一定比新手的推荐更有价值吗?为了更详细地理解这一点,我们将继续使用之前的炒鸡蛋示例,但会使用不同的入站链接。下一页的图表展示了新的设置:Bert 和 Ernie 现在各自拥有相同数量的入站链接(只有一个),但 Ernie 的入站链接来自我自己的主页,而 Bert 的链接来自著名厨师 Alice Waters。
You may already be wondering why all the incoming links to a page should be treated equally. Surely a recommendation from an expert is worth more than one from a novice? To understand this in detail, we will stick with the scrambled eggs example from before, but with a different set of incoming links. The figure on the following page shows the new setup: Bert and Ernie each now have the same number of incoming links (just one), but Ernie's incoming link is from my own home page, whereas Bert's is from the famous chef Alice Waters.
如果您没有其他信息,您会选择谁的食谱?显然,最好选择著名厨师推荐的食谱,而不是计算机科学书籍作者推荐的食谱。这个基本原则我们称之为“权威技巧”:来自高“权威”页面的链接应该比来自低权威页面的链接获得更高的排名。
If you had no other information, whose recipe would you prefer? Obviously, it's better to choose the one recommended by a famous chef, rather than the one recommended by the author of a book about computer science. This basic principle is what we'll call the “authority trick”: links from pages with high “authority” should result in a higher ranking than links from pages with low authority.
权威性技巧的原理。图中展示了四个网页:两个炒蛋食谱和两个指向这些食谱的链接页面。其中一个链接来自本书的作者(作者并非名厨),另一个来自名厨 Alice Waters 的主页。权威性技巧将 Bert 的页面排名高于 Ernie 的页面,因为 Bert 的链接比 Ernie 的“权威性”更高。
The basis for the authority trick. Four web pages are shown: two scrambled egg recipes and two pages that link to the recipes. One of the links is from the author of this book (who is not a famous chef) and one is from the home page of the famous chef Alice Waters. The authority trick ranks Bert's page above Ernie's, because Bert's incoming link has greater “authority” than Ernie's.
这个原则本身很好,但就目前的形式而言,它对搜索引擎来说毫无用处。计算机如何自动判断 Alice Waters 在炒鸡蛋方面的权威性比我高呢?这里有一个可能有帮助的想法:让我们将超链接技巧与权威性技巧结合起来。所有页面的初始权威性得分都是 1,但如果一个页面有一些外部链接,则其权威性得分的计算方法是将所有指向该页面的页面的权威性加起来。换句话说,如果页面 X 和 Y 链接到页面 Z,那么 Z 的权威性就是 X 的权威性加上 Y 的权威性。
This principle is all well and good, but in its present form it is useless to search engines. How can a computer automatically determine that Alice Waters is a greater authority on scrambled eggs than me? Here is an idea that might help: let's combine the hyperlink trick with the authority trick. All pages start off with an authority score of 1, but if a page has some incoming links, its authority is calculated by adding up the authority of all the pages that point to it. In other words, if pages X and Y link to page Z, then the authority of Z is just the authority of X plus the authority of Y.
下一页的图给出了一个详细的例子,计算了两种炒蛋食谱的权威分数。最终分数以圆圈显示。有两个页面链接到我的主页;这些页面本身没有传入链接,所以它们的分数为 1。我的主页获得所有传入链接的总分,加起来为 2。Alice Waters 的主页有 100 个传入链接,每个链接的分数为 1,所以她的分数为 100。Ernie 的食谱只有一个传入链接,但它来自一个分数为 2 的页面,因此通过将所有传入分数相加(在这种情况下只有一个数字要加),Ernie 的分数为 2。Bert 的食谱也只有一个传入链接,值为 100,所以 Bert 的最终得分是 100。由于 100 大于 2,所以 Bert 的页面排名高于 Ernie 的。
The figure on the next page gives a detailed example, calculating authority scores for the two scrambled egg recipes. The final scores are shown in circles. There are two pages that link to my home page; these pages have no incoming links themselves, so they get scores of 1. My home page gets the total score of all its incoming links, which adds up to 2. Alice Waters's home page has 100 incoming links that each have a score of 1, so she gets a score of 100. Ernie's recipe has only one incoming link, but it is from a page with a score of 2, so by adding up all the incoming scores (in this case there is only one number to add), Ernie gets a score of 2. Bert's recipe also has only one incoming link, valued at 100, so Bert's final score is 100. And because 100 is greater than 2, Bert's page gets ranked above Ernie's.
简单计算一下这两道炒蛋食谱的“权威分数”。权威分数以圆圈表示。
A simple calculation of “authority scores” for the two scrambled egg recipes. The authority scores are shown in circles.
随机冲浪者的技巧
THE RANDOM SURFER TRICK
我们似乎找到了一种真正有效的自动计算权威分数的策略,无需计算机真正理解页面内容。然而,这种方法可能存在一个重大问题。超链接很可能形成计算机科学家所说的“循环”。如果只需点击超链接就能回到起点,那么就存在循环。
It seems like we have hit on a strategy for automatically calculating authority scores that really works, without any need for a computer to actually understand the content of a page. Unfortunately, there can be a major problem with the approach. It is quite possible for hyperlinks to form what computer scientists call a “cycle.” A cycle exists if you can get back to your starting point just by clicking on hyperlinks.
下图给出了一个例子。有 5 个网页,分别标记为 A、B、C、D 和 E。如果我们从 A 开始,我们可以从A点击到 B,然后从B点击到 E,再从E点击到A,也就是我们开始的地方。这意味着A、B和E形成了一个循环。
The figure on the following page gives an example. There are 5 web pages labeled A, B, C, D, and E. If we start at A, we can click through from A to B, and then from B to E—and from E we can click through to A, which is where we started. This means that A, B, and E form a cycle.
事实证明,只要出现循环,我们目前对“权威得分”的定义(结合了超链接技巧和权威技巧)就会陷入大麻烦。让我们看看这个特定示例会发生什么。页面C和D没有入站链接,因此它们的得分为 1。C和D都链接到 A,因此A获得C和 D的总和,即 1 + 1 = 2。然后B从 A 获得 2 分,E从 B 获得 2 分。(上图左侧面板总结了到目前为止的情况。)但现在A已经过时了:它仍然从C和 D 各获得 1 分,但也从 E 获得 2 分,总共 4 分。但现在B已经过时:它从 A 获得 4 分。但随后E需要更新,因此它从 B 获得 4 分。(现在我们位于上图的右侧面板。)依此类推:现在A是 6,所以B是 6,所以E是 6,所以A是 8,...你明白了吧?我们必须一直循环下去,分数也会随着循环的进行不断增加。
It turns out that our current definition of “authority score” (combining the hyperlink trick and the authority trick) gets into big trouble whenever there is a cycle. Let's see what happens on this particular example. Pages C and D have no incoming links, so they get a score of 1. C and D both link to A, so A gets the sum of C and D, which is 1 + 1 = 2. Then B gets the score 2 from A, and E gets 2 from B. (The situation so far is summarized by the left-hand panel of the figure above.) But now A is out of date: it still gets 1 each from C and D, but it also gets 2 from E, for a total of 4. But now B is out of date: it gets 4 from A. But then E needs updating, so it gets 4 from B. (Now we are at the right-hand panel of the figure above.) And so on: now A is 6, so B is 6, so E is 6, so A is 8,…. You get the idea, right? We have to go on forever with the scores always increasing as we go round the cycle.
超链接循环的示例。页面A、B和 E 形成一个循环,因为您可以从A开始,点击到B,然后点击到E ,然后返回到A的起点。
An example of a cycle of hyperlinks. Pages A, B, and E form a cycle because you can start at A, click through to B, then E, and then return to your starting point at A.
循环引起的问题。A 、 B和 E 总是过时,而他们的分数却永远在增长。
The problem caused by cycles. A, B, and E are always out of date, and their scores keep growing forever.
以这种方式计算权威分数会产生一个“先有鸡还是先有蛋”的问题。如果我们知道 A 的真实权威分数,我们就可以计算B和 E 的权威分数。如果我们知道B和E的真实分数,我们就可以计算A的分数。但由于两者相互依赖,这似乎是不可能的。
Calculating authority scores this way creates a chicken-and-egg problem. If we knew the true authority score for A, we could compute the authority scores for B and E. And if we knew the true scores for B and E, we could compute the score for A. But because each depends on the other, it seems as if this would be impossible.
幸运的是,我们可以使用一种名为“随机浏览技巧”的技术来解决这个“先有鸡还是先有蛋”的问题。请注意:对“随机浏览技巧”的初步描述与之前讨论的超链接技巧和权威技巧毫无相似之处。在了解了“随机浏览技巧”的基本机制后,我们将进行一些分析,以揭示其卓越的特性:它结合了超链接技巧和权威技巧的优点,即使在存在循环超链接的情况下也能有效。
Fortunately, we can solve this chicken-and-egg problem using a technique we'll call the random surfer trick. Beware: the initial description of the random surfer trick bears no resemblance to the hyperlink and authority tricks discussed so far. Once we've covered the basic mechanics of the random surfer trick, we'll do some analysis to uncover its remarkable properties: it combines the desirable features of the hyperlink and authority tricks, but works even when cycles of hyperlinks are present.
随机浏览者模型。深色阴影表示浏览者访问过的页面,虚线箭头表示随机重启。路径从页面A开始,沿着随机选择的超链接行进,中间被两次随机重启打断。
The random surfer model. Pages visited by the surfer are darkly shaded, and the dashed arrows represent random restarts. The trail starts at page A and follows randomly selected hyperlinks interrupted by two random restarts.
这个技巧的第一步是想象一个人在随机浏览互联网。具体来说,我们的浏览者从整个万维网中随机选择一个网页开始。然后,浏览者检查页面上的所有超链接,随机选择其中一个并点击。之后,浏览新的页面,并随机选择其中一个超链接。这个过程持续下去,每个新页面都是通过点击前一页面上的超链接随机选择的。上图展示了一个例子,我们假设整个万维网仅由 16 个网页组成。方框代表网页,箭头代表页面之间的超链接。其中四个页面已标记,以便日后参考。浏览者访问的网页用深色阴影表示,浏览者点击的超链接用黑色表示,虚线箭头代表随机重启,这将在下文中描述。
The trick begins by imagining a person who is randomly surfing the internet. To be specific, our surfer starts off at a single web page selected at random from the entire World Wide Web. The surfer then examines all the hyperlinks on the page, picks one of them at random, and clicks on it. The new page is then examined and one of its hyperlinks is chosen at random. This process continues, with each new page being selected randomly by clicking a hyperlink on the previous page. The figure above shows an example, in which we imagine that the entire World Wide Web consists of just 16 web pages. Boxes represent the web pages, and arrows represent hyperlinks between the pages. Four of the pages are labeled for easy reference later. Web pages visited by the surfer are darkly shaded, hyperlinks clicked by the surfer are colored black, and the dashed arrows represent random restarts, which are described next.
这个过程有一个小技巧:每次访问一个页面时,存在一个固定的重启概率(比如 15%),即用户不会点击任何可用的超链接。相反,用户会从整个网络中随机选择另一个页面,重新开始这个过程。我们可以假设用户有 15% 的概率对任何页面感到厌倦,从而转向访问新的链接链。为了便于理解,请仔细查看上图。这位用户从页面A开始,在访问了三个随机超链接后,对页面B感到厌倦,于是重新访问了页面C。在再次访问之前,用户又访问了两个随机超链接。(顺便说一句,本章中所有随机用户的示例都使用 15% 的重启概率,这与谷歌联合创始人佩奇和布林在描述其搜索引擎原型的原始论文中使用的值相同。)
There is one twist in the process: every time a page is visited, there is some fixed restart probability (say, 15%) that the surfer does not click on one of the available hyperlinks. Instead, he or she restarts the procedure by picking another page randomly from the whole web. It might help to think of the surfer having a 15% chance of getting bored by any given page, causing him or her to follow a new chain of links instead. To see some examples, take a closer look at the figure above. This particular surfer started at page A and followed three random hyperlinks before getting bored by page B and restarting on page C. Two more random hyperlinks were followed before the next restart. (By the way, all the random surfer examples in this chapter use a restart probability of 15%, which is the same value used by Google cofounders Page and Brin in the original paper describing their search engine prototype.)
用计算机模拟这个过程很容易。我编写了一个程序来实现这一点,并运行它直到浏览者访问了1000个页面。(当然,这并不意味着访问了1000个不同的页面。对同一页面的多次访问也计算在内,在这个小例子中,所有页面都被访问了多次。)这1000次模拟访问的结果显示在下一页图表的顶部面板中。您可以看到,页面D是访问次数最多的,有144次访问。
It's easy to simulate this process by computer. I wrote a program to do just that and ran it until the surfer had visited 1000 pages. (Of course, this doesn't mean 1000 distinct pages. Multiple visits to the same page are counted, and in this small example, all of the pages were visited many times.) The results of the 1000 simulated visits are shown in the top panel of the figure on the next page. You can see that page D was the most frequently visited, with 144 visits.
就像民意调查一样,我们可以通过增加随机样本的数量来提高模拟的准确性。我重新运行了模拟,这次等到浏览者访问了100万个页面后再进行模拟。(如果您好奇的话,这在我的电脑上只需不到半秒钟!)由于访问量如此之大,最好将结果以百分比的形式呈现。您可以在下一页图表的底部面板中看到这一点。同样,页面D的访问量最高,占访问量的15%。
Just as with public opinion polls, we can improve the accuracy of our simulation by increasing the number of random samples. I reran the simulation, this time waiting until the surfer had visited one million pages. (In case you're wondering, this takes less than half a second on my computer!) With such a large number of visits, it's preferable to present the results as percentages. This is what you can see in the bottom panel of the figure on the facing page. Again, page D was the most frequently visited, with 15% of the visits.
我们的随机浏览者模型和我们想要用来对网页进行排名的权威技巧之间有什么联系?事实证明,通过随机浏览者模拟计算出的百分比正是我们衡量网页权威性所需的。因此,我们将网页的浏览者权威分数定义为随机浏览者访问该网页所花费时间的百分比。值得注意的是,浏览者权威分数融合了我们之前用于对网页重要性进行排名的两个技巧。我们将依次逐一探讨这两个技巧。
What is the connection between our random surfer model and the authority trick that we would like to use for ranking web pages? The percentages calculated from random surfer simulations turn out to be exactly what we need to measure a page's authority. So let's define the surfer authority score of a web page to be the percentage of time that a random surfer would spend visiting that page. Remarkably, the surfer authority score incorporates both of our earlier tricks for ranking the importance of web pages. We'll examine these each in turn.
首先,我们运用了超链接技巧:这里的核心思想是,拥有大量入站链接的页面应该获得较高的排名。这在随机浏览者模型中也同样适用,因为拥有大量入站链接的页面被访问的机会更大。下一页下方面板中的页面D就是一个很好的例子:它拥有五个入站链接——比模拟中的任何其他页面都多——最终获得了最高的浏览者权威得分(15%)。
First, we had the hyperlink trick: the main idea here was that a page with many incoming links should receive a high ranking. This is also true in the random surfer model, because a page with many incoming links has many chances to be visited. Page D in the lower panel on the next page is a good example of this: it has five incoming links-more than any other page in the simulation—and ends up having the highest surfer authority score (15%).
其次,我们运用了权威技巧。其核心思想是,来自权威性较高的页面的链接应该比来自权威性较低的页面的链接更能提升页面的排名。随机浏览者模型也考虑到了这一点。为什么?因为来自热门页面的链接比来自冷门页面的链接有更多被追踪的机会。为了在我们的模拟示例中看到这一点,请比较上图下方的页面A和C:每个页面都只有一个链接,但由于A 的链接质量较高,其浏览者权威性得分要高得多(13% vs. 2%)。
Second, we had the authority trick. The main idea was that an incoming link from a highly authoritative page should improve a page's ranking more than an incoming link from a less authoritative page. Again, the random surfer model takes account of this. Why? Because an incoming link from a popular page will have more opportunities to be followed than a link from an unpopular page. To see an instance of this in our simulation example, compare pages A and C in the lower panel above: each has exactly one incoming link, but page A has a much higher surfer authority score (13% vs. 2%) because of the quality of its incoming link.
随机浏览者模拟。上图:1000 次访问模拟中每个页面的访问次数。下图:100 万次访问模拟中每个页面的访问百分比。
Random surfer simulations. Top: Number of visits to each page in a 1000-visit simulation. Bottom: Percentage of visits to each page in a simulation of one million visits.
第 29 页上的炒鸡蛋示例的浏览者权威分数。Bert 和 Ernie 各自都有一个赋予其页面权威的传入链接,但 Bert 的页面在网络搜索查询“炒鸡蛋”时排名更高。
Surfer authority scores for the scrambled egg example on page 29. Bert and Ernie each have exactly one incoming link conferring authority on their pages, but Bert's page will be ranked higher in a web search query for “scrambled eggs.”
请注意,随机浏览者模型同时结合了超链接技巧和权威性技巧。换句话说,每个页面的入站链接的质量和数量都会被考虑在内。页面B就体现了这一点:它获得了相对较高的分数(10%),是因为有三个入站链接来自分数中等的页面(分数在 4% 到 7% 之间)。
Notice that the random surfer model simultaneously incorporates both the hyperlink trick and authority trick. In other words, the quality and quantity of incoming links at each page are all taken into account. Page B demonstrates this: it receives its relatively high score (10%) due to three incoming links from pages with moderate scores, ranging from 4% to 7%.
随机浏览者技巧的妙处在于,与权威技巧不同,无论超链接是否存在循环,它都能完美地发挥作用。回到我们之前的炒鸡蛋示例(第29页),我们可以轻松地运行随机浏览者模拟。经过数百万次访问后,我自己的模拟得出了如上图所示的浏览者权威分数。请注意,与我们之前使用权威技巧的计算结果一样,Bert 的页面得分远高于 Ernie 的页面(28% vs. 1%)——尽管两者都只有一个入站链接。因此,Bert 在“炒鸡蛋”的网络搜索查询中排名会更高。
The beauty of the random surfer trick is that, unlike the authority trick, it works perfectly well whether or not there are cycles in the hyperlinks. Going back to our earlier scrambled egg example (page 29), we can easily run a random surfer simulation. After several million visits, my own simulation produced the surfer authority scores shown in the figure above. Notice that, as with our earlier calculation using the authority trick, Bert's page receives a much higher score than Ernie's (28% vs. 1%)—despite the fact that each has exactly one incoming link. So Bert would be ranked higher in a web search query for “scrambled eggs.”
现在让我们回到之前那个更难的例子:第 30 页上的图,由于超链接的循环,它给我们最初的权威技巧带来了一个难以克服的问题。同样,运行一个随机浏览者的计算机模拟很容易,并生成上图中的浏览者权威分数。通过此模拟确定的浏览者权威分数告诉我们搜索引擎在返回结果时将使用的最终排名:页面A最高,其次是B,然后是E,C和D并列最后。
Now let's turn to the more difficult example from earlier: the figure on page 30, which caused an insurmountable problem for our original authority trick because of the cycle of hyperlinks. Again, it's easy to run a computer simulation of random surfers, producing the surfer authority scores in the figure above. The surfer authority scores determined by this simulation tell us the final ranking that would be used by a search engine when returning results: page A is highest, followed by B, then E, with C and D sharing last place.
前面例子中,存在超链接循环(第30页),其冲浪者权威评分如下。尽管存在循环(A –> B –> E –> A),随机冲浪技巧仍能轻松计算出合适的评分。
Surfer authority scores for the earlier example with a cycle of hyperlinks (page 30). The random surfer trick has no trouble computing appropriate scores, despite the presence of a cycle (A –> B –> E –> A).
PAGERANK 实践
PAGERANK IN PRACTICE
谷歌联合创始人在他们1998年那篇如今已声名显赫的会议论文《大型超文本网络搜索引擎剖析》中描述了随机浏览技巧。该技巧的变体与许多其他技术相结合,至今仍被各大搜索引擎使用。然而,由于诸多复杂因素的存在,现代搜索引擎实际采用的技术与本文所述的随机浏览技巧略有不同。
The random surfer trick was described by Google's cofounders in their now-famous 1998 conference paper, “The Anatomy of a Large-scale Hypertextual Web Search Engine.” In combination with many other techniques, variants of this trick are still used by the major search engines. There are, however, numerous complicating factors, which mean that the actual techniques employed by modern search engines differ somewhat from the random surfer trick described here.
其中一个复杂因素触及了 PageRank 的核心:超链接赋予合法权威性的假设有时值得怀疑。我们已经知道,尽管超链接可能代表批评而不是建议,但这在实践中往往不是一个大问题。一个更严重的问题是,人们会滥用超链接技巧来人为地提高自己网页的排名。假设您经营一个名为BooksBooksBooks.com的网站,该网站销售(令人惊讶的是)书籍。使用自动化技术,可以相对轻松地创建大量(比如 10,000 个)不同的网页,这些网页都包含指向BooksBooksBooks.com的链接。因此,如果搜索引擎完全按照此处所述计算 PageRank 权威性,BooksBooksBooks.com可能会不公平地获得比其他书店高出数千倍的分数,从而获得更高的排名,从而增加销量。
One of these complicating factors strikes at the heart of PageRank: the assumption that hyperlinks confer legitimate authority is sometimes questionable. We already learned that although hyperlinks can represent criticisms rather than recommendations, this tends not to be a significant problem in practice. A much more severe problem is that people can abuse the hyperlink trick to artificially inflate the ranking of their own web pages. Suppose you run a website called BooksBooksBooks.com that sells (surprise, surprise) books. Using automated technology, it's relatively easy to create a large number—say, 10,000—of different web pages that all have links to BooksBooksBooks.com. Thus, if search engines computed PageRank authorities exactly as described here, BooksBooksBooks.com might undeservedly get a score thousands of times higher than other bookstores, resulting in a high ranking and thus more sales.
搜索引擎将这种滥用行为称为网络垃圾。(该术语源于与电子邮件垃圾的类比:电子邮件收件箱中的垃圾信息类似于那些扰乱网络搜索结果的垃圾网页。)检测和清除各种类型的网络垃圾是所有搜索引擎持续进行的重要任务。例如,2004 年,微软的一些研究人员发现,超过 30 万个网站,却恰好有1001 个页面链接到它们——这种情况非常可疑。通过手动检查这些网站,研究人员发现,这些传入的超链接绝大多数都是网络垃圾。
Search engines call this kind of abuse web spam. (The terminology comes from an analogy with e-mail spam: unwanted messages in your e-mail inbox are similar to unwanted web pages that clutter the results of a web search.) Detecting and eliminating various types of web spam are important ongoing tasks for all search engines. For example, in 2004, some researchers at Microsoft found over 300,000 websites that had exactly 1001 pages linking to them—a very suspicious state of affairs. By inspecting these websites manually, the researchers found that the vast majority of these incoming hyperlinks were web spam.
因此,搜索引擎正与网络垃圾制造者展开一场“军备竞赛”,并不断尝试改进其算法,以提供更合理的排名。这种提高 PageRank 排名的动力催生了大量学术界和工业界的研究,他们研究利用网络超链接结构对网页进行排名的其他算法。这类算法通常被称为基于链接的排名算法。
Hence, search engines are engaged in an arms race against web spammers and are constantly trying to improve their algorithms in order to return realistic rankings. This drive to improve PageRank has spawned a great deal of academic and industrial research into other algorithms that use the hyperlink structure of the web for ranking pages. Algorithms of this kind are often referred to as link-based ranking algorithms.
另一个复杂因素与 PageRank 计算的效率有关。我们的浏览者权威分数是通过运行随机模拟计算得出的,但在整个网络上运行此类模拟耗时过长,难以实际应用。因此,搜索引擎不会通过模拟随机浏览者来计算其 PageRank 值:它们使用数学技术,得出的结果与我们自己的随机浏览者模拟相同,但计算成本却低得多。我们研究浏览者模拟技术,是因为它直观易懂,而且它描述的是搜索引擎计算的内容,而不是计算方式。
Another complicating factor relates to the efficiency of PageRank computations. Our surfer authority scores were computed by running random simulations, but running a simulation of that kind on the entire web would take far too long to be of practical use. So search engines do not compute their PageRank values by simulating random surfers: they use mathematical techniques that give the same answers as our own random surfer simulations, but with far less computational expense. We studied the surfer-simulation technique because of its intuitive appeal, and because it describes what the search engines calculate, not how they calculate it.
值得注意的是,商业搜索引擎确定其排名的算法远不止基于链接的 PageRank 等排名算法。早在 1998 年,谷歌的联合创始人在最初发布的谷歌描述中就提到了其他一些有助于提高搜索结果排名的功能。正如您所料,这项技术从那时起就不断发展:在撰写本文时,谷歌官网上声明其使用了“超过 200 个信号”来评估页面的重要性。
It's also worth noting that commercial search engines determine their rankings using a lot more than just a link-based ranking algorithm like PageRank. Even in their original, published description of Google back in 1998, Google's cofounders mentioned several other features that contributed to the ranking of search results. As you might expect, the technology has moved on from there: at the time of writing, Google's own website states that “more than 200 signals” are used in assessing the importance of a page.
尽管现代搜索引擎错综复杂,但 PageRank 的核心理念——权威页面可以通过超链接赋予其他页面权威性——依然有效。正是这一理念帮助谷歌击败了 AltaVista,并在短短几年内从一家小型初创公司一跃成为搜索之王。如果没有 PageRank 的核心理念,大多数网络搜索查询都会淹没在成千上万个匹配但不相关的网页的汪洋大海中。PageRank 确实是一颗算法瑰宝,它能让一根针毫不费力地从大海捞针中升起。
Despite the many complexities of modern search engines, the beautiful idea at the heart of PageRank—that authoritative pages can confer authority on other pages via hyperlinks—remains valid. It was this idea that helped Google to dethrone AltaVista, transforming Google from small startup to king of search in a few heady years. Without the core idea of PageRank, most web search queries would drown in a sea of thousands of matching, but irrelevant, web pages. PageRank is indeed an algorithmic gem that allows a needle to rise effortlessly to the top of its haystack.
4
4
公钥密码学:通过明信片发送秘密
Public Key Cryptography: Sending Secrets on a Postcard
——B OB DYLAN ,圣约女人
—BOB DYLAN, Covenant Woman
人类喜欢八卦,也喜欢秘密。由于密码学的目标是传递秘密,我们都是天生的密码学家。但人类比计算机更容易进行秘密交流。如果你想把秘密告诉朋友,你可以悄悄地在他耳边说。而计算机做到这一点并不容易。一台计算机根本无法将信用卡号“悄悄”地告诉另一台计算机。如果计算机通过互联网连接,它们就无法控制信用卡号的去向,也无法控制哪些其他计算机能够获取到它。在本章中,我们将了解计算机如何解决这个问题,并使用有史以来最巧妙、最具影响力的计算机科学思想之一:公钥密码学。
Humans love to gossip, and they love secrets. And since the goal of cryptography is to communicate secrets, we are all natural cryptographers. But humans can communicate secretly more easily than computers. If you want to tell a secret to your friend, you can just whisper in your friend's ear. It's not so easy for computers to do that. There's no way for one computer to “whisper” a credit card number to another computer. If the computers are connected by the internet, they have no control over where that credit card number goes, and which other computers get to find it out. In this chapter we'll find out how computers get around this problem, using one of the most ingenious and influential computer science ideas of all time: public key cryptography.
看到这里,你可能会好奇,为什么本章的标题会提到“在明信片上传递秘密”。下一页的图表揭示了答案:用明信片进行通信可以作为类比,来展示公钥密码学的威力。在现实生活中,如果你想给某人发送一份机密文件,你当然会在发送前将文件放入安全密封的信封中。这并不能保证保密性,但却是朝着正确方向迈出的明智一步。另一方面,如果你选择在发送明信片之前将机密信息写在明信片背面,那么保密性显然会被侵犯:任何接触明信片的人(例如邮局工作人员)都可以直接查看明信片并阅读信息。
At this point, you may be wondering why the title of this chapter refers to “sending secrets on a postcard.” The figure on the facing page reveals the answer: communicating via postcards can be used as an analogy to demonstrate the power of public key cryptography. In real life, if you wanted to send a confidential document to someone, you would, of course, enclose the document in a securely sealed envelope before sending it. This doesn't guarantee confidentiality, but it is a sensible step in the right direction. If, on the other hand, you chose to write your confidential message on the back of a postcard before sending it, confidentiality is obviously violated: anyone who comes in contact with the postcard (postal workers, for example) can just look at the postcard and read the message.
这正是计算机在互联网上进行保密通信时面临的问题。由于互联网上的任何消息通常都要经过无数台被称为路由器的计算机,因此任何能够访问路由器的人都可以看到消息的内容——这其中就包括潜在的恶意窃听者。因此,从你的电脑进入互联网的每一条数据,都像是写在明信片上一样!
This is precisely the problem that computers face when trying to communicate confidentially with each other on the internet. Because any message on the internet typically travels through numerous computers called routers, the contents of the message can be seen by anyone with access to the routers—and this includes potentially malicious eavesdroppers. Thus, every single piece of data that leaves your computer and enters the internet might as well be written on a postcard!
明信片的类比:显然,通过邮寄系统寄送明信片并不能保证内容的保密性。出于同样的原因,如果信用卡号没有经过适当的加密,从你的笔记本电脑发送到亚马逊网站,很容易被窃听者窃取。
The postcard analogy: It's obvious that sending a postcard through the mail system will not keep the contents of the postcard secret. For the same reason, a credit card number sent from your laptop to Amazon.com can easily be snooped by an eavesdropper if it is not properly encrypted.
你可能已经想到了一个快速解决明信片问题的方法。为什么我们不在写明信片之前用密码加密每条信息呢?实际上,如果你已经知道明信片的寄件人是谁,这种方法就很有效。这是因为你们可能在过去的某个时候就约定好了要使用什么密码。真正的问题是当你把明信片寄给一个你不认识的人时。如果你在明信片上使用密码,邮局工作人员将无法阅读你的信息,但收件人也一样!公钥加密的真正威力在于,它允许你使用只有收件人才能解密的密码——尽管你们根本没有机会秘密地约定使用什么密码。
You may have already thought of a quick fix for this postcard problem. Why don't we just use a secret code to encrypt each message before writing it on a postcard? Actually, this works just fine if you already know the person to whom you're sending the postcard. This is because you could have agreed, at some time in the past, on what secret code to use. The real problem is when you send a postcard to someone you don't know. If you use a secret code on this postcard, the postal workers will not be able to read your message, but neither will the intended recipient! The real power of public key cryptography is that it allows you to employ a secret code that only the recipient can decrypt—despite the fact that you had no chance to secretly agree on what code to use.
请注意,计算机在与“不认识”的接收者通信时也面临同样的问题。例如,您第一次使用信用卡在Amazon.com购物时,您的计算机必须将您的信用卡号传输到亚马逊的服务器计算机。但您的计算机之前从未与亚马逊服务器通信过,因此这两台计算机过去从未有机会就密码达成一致。而且,它们试图达成的任何协议都可能被它们之间互联网路由上的所有路由器观察到。
Note that computers face the same problem of communicating with recipients they don't “know.” For example, the first time you purchase something from Amazon.com using your credit card, your computer must transmit your credit card number to a server computer at Amazon. But your computer has never communicated before with the Amazon server, so there has been no opportunity in the past for these two computers to agree on a secret code. And any agreement they try to make can be observed by all routers on the internet route between them.
让我们回到明信片的类比。诚然,这种情况听起来像是一个悖论:收件人会看到与邮递员完全相同的信息,但收件人会以某种方式学习如何解码信息,而邮递员却不会。公钥密码学为这一悖论提供了解决方案。本章将解释如何实现这一解决方案。
Let's switch back to the postcard analogy. Admittedly, the situation sounds like a paradox: the recipient will see exactly the same information as the postal workers, but somehow the recipient will learn how to decode the message, whereas the postal workers will not. Public key cryptography provides a resolution to this paradox. This chapter explains how.
使用共享秘密加密
ENCRYPTING WITH A SHARED SECRET
让我们从一个非常简单的思维实验开始。我们先不考虑明信片的类比,而是更简单的一个:房间里的口头交流。具体来说,你和你的朋友阿诺德以及你的敌人伊芙待在一个房间里。你想秘密地向阿诺德传达一条信息,但伊芙无法理解。这条信息可能是一张信用卡卡号,但为了简单起见,我们假设它是一个非常短的信用卡卡号——只有1到9之间的一位数字。此外,你唯一可以与阿诺德交流的方式是大声说话,这样伊芙才能听到。任何偷偷摸摸的伎俩,比如低声耳语或递给他一张手写便条,都是不允许的。
Let's start with a very simple thought experiment. We'll abandon the postcard analogy for something even simpler: verbal communication in a room. Specifically, you're in a room with your friend Arnold and your enemy Eve. You want to secretly communicate a message to Arnold, without Eve being able to understand the message. Maybe the message is a credit card number, but let's keep things simple and suppose that it's an incredibly short credit card number—just a single digit between 1 and 9. Also, the only way you're allowed to communicate with Arnold is by speaking out loud so that Eve can overhear. No sneaky tricks, like whispering or passing him a handwritten note, are permitted.
具体来说,假设您要传达的信用卡号是数字 7。您可以这样处理。首先,试着想出一个 Arnold 知道但 Eve 不知道的数字。比如说,假设您和 Arnold 是老朋友了,小时候住在同一条街上。事实上,假设你们经常在位于普莱森特街 322 号的您家前院玩耍。此外,假设 Eve 小时候不认识您,特别是她不知道您和 Arnold 曾经玩耍的这所房子的地址。那么您可以对 Arnold 说:“嘿 Arnold,还记得我们小时候经常玩耍的位于普莱森特街的我家房子的号码吗?好吧,如果您把这个门牌号加上我现在想到的 1 位信用卡号,您会得到 329。”
To be specific, let's assume the credit card number you're trying to communicate is the number 7. Here's one way you could go about it. First, try to think of some number that Arnold knows but Eve doesn't. Let's say, for example, that you and Arnold are very old friends and lived on the same street as children. In fact, suppose you both often played in the front yard of your family's house at 322 Pleasant Street. Also, suppose that Eve didn't know you as a child and, in particular, she doesn't know the address of this house where you and Arnold used to play. Then you can say to Arnold: “Hey Arnold, remember the number of my family's house on Pleasant Street where we used to play as children? Well, if you take that house number, and add on the 1-digit credit card number I'm thinking of right now, you get 329.”
加法技巧:消息 7 通过将共享密钥 322 添加到加密中来加密。Arnold 可以通过减去共享密钥来解密,但 Eve 不能。
The addition trick: The message 7 is encrypted by adding it to the shared secret, 322. Arnold can decrypt it by subtracting the shared secret, but Eve cannot.
现在,只要阿诺德记得门牌号,他就能用你告诉他的总数 329 减去门牌号,算出信用卡号。他用 329 减 322,得到 7,这就是你试图告诉他的信用卡号。与此同时,伊芙却不知道信用卡号是多少,尽管她听到了你对阿诺德说的每一个字。上图演示了整个过程。
Now, as long as Arnold remembers the house number correctly, he can work out the credit card number by subtracting off the house number from the total you told him: 329. He calculates 329 minus 322 and gets 7, which is the credit card number you were trying to communicate to him. Meanwhile, Eve has no idea what the credit card number is, despite the fact that she heard every word you said to Arnold. The figure above demonstrates the whole process.
为什么这种方法有效?因为你和阿诺德拥有一个被计算机科学家称为共享秘密的东西:数字 322。因为你们都知道这个数字,而伊芙不知道,所以你们可以用这个共享秘密秘密地传递任何你们想要传递的数字,只需加上这个数字,宣布总数,然后让对方减去这个共享秘密即可。听到总数对伊芙来说毫无意义,因为她不知道该从中减去什么。
Why does this method work? Well, you and Arnold have a thing that computer scientists call a shared secret: the number 322. Because you both know this number, but Eve doesn't, you can use the shared secret to secretly communicate any other number you want, just by adding it on, announcing the total, and letting the other party subtract the shared secret. Hearing the total is of no use to Eve, because she doesn't know what number to subtract from it.
信不信由你,如果你理解了这个简单的“加法技巧”,即在私人信息(比如信用卡号)中添加一个共享密钥,那么你已经明白了互联网上绝大多数加密技术的实际工作原理!计算机一直在使用这个技巧,但为了真正确保安全,还有一些细节需要注意。
Believe it or not, if you understood this simple “addition trick” of adding a shared secret to a private message like a credit card number, then you already understand how the vast majority of encryption on the internet actually works! Computers are constantly using this trick, but for it to be truly secure there are a few more details that need to be taken care of.
首先,计算机使用的共享密钥必须比门牌号 322 长得多。如果密钥太短,任何窃听对话的人都可以尝试所有可能性。例如,假设我们用一个 3 位数的门牌号,通过加法技巧加密一个真实的16 位数信用卡号。
First, the shared secrets that computers use need to be much longer than the house number 322. If the secret is too short, anyone eavesdropping on the conversation can just try out all the possibilities. For example, suppose we used a 3-digit house number to encrypt a real 16-digit credit card number using the addition trick.
请注意,三位数门牌号有 999 个可能,所以像 Eve 这样偷听到我们谈话的对手可以算出 999 个可能的数字,其中一个肯定是信用卡号。计算机几乎不需要任何时间就能尝试出 999 个信用卡号,所以如果要让共享密钥有用,我们需要使用远多于 3 位的数字。
Note that there are 999 possible 3-digit house numbers, so an adversary like Eve who overheard our conversation could work out a list of 999 possible numbers, of which one must be the credit card number. It would take a computer almost no time at all to try out 999 credit card numbers, so we need to use a lot more than 3 digits in a shared secret if it is going to be useful.
事实上,当你听说某种加密有一定位数时,比如“128 位加密”,这实际上是在描述共享密钥的长度。共享密钥通常被称为“密钥”,因为它可以用来解锁或“解密”消息。如果你计算出密钥中位数的 30%,就等于密钥中大约有多少位数字。因为 128 的 30% 约为 38,所以我们知道 128 位加密使用的密钥是 38 位数字。1 38位数字大于十亿亿亿亿,而且由于任何已知的计算机都需要数十亿年才能尝试这么多的可能性,因此 38 位的共享密钥被认为是非常安全的。
In fact, when you hear about a type of encryption being a certain number of bits, as in the phrase “128-bit encryption,” this is actually a description of how long the shared secret is. The shared secret is often called a “key,” since it can be used to unlock, or “decrypt,” a message. If you work out 30% of the number of bits in the key, that tells you the approximate number of digits in the key. So because 30% of 128 is about 38, we know that 128-bit encryption uses a key that is a 38-digit number.1 A 38-digit number is bigger than a billion billion billion billion, and because it would take any known computer billions of years to try out that many possibilities, a shared secret of 38 digits is considered to be very secure.
还有一个问题使得简单版的加法技巧在现实生活中无法奏效:加法运算的结果可以被统计分析,这意味着有人可以通过分析大量加密信息来破解你的密钥。因此,现代加密技术,即所谓的“分组密码”,使用的是加法技巧的变体。
There's one more wrinkle that prevents the simple version of the addition trick from working in real life: the addition produces results that can be analyzed statistically, meaning that someone could work out your key based on analyzing a large number of your encrypted messages. Instead, modern encryption techniques, called “block ciphers,” use a variant of the addition trick.
首先,长消息被拆分成固定大小的小“块”,通常约为 10-15 个字符。其次,并非简单地将消息块和密钥相加,而是根据一组固定的规则对每个块进行多次变换。这些规则类似于加法,但会使消息和密钥的混合程度更高。例如,规则可以是“将密钥的前半部分添加到块的后半部分,反转结果,然后将密钥的后半部分添加到块的后半部分”——尽管实际上规则要复杂得多。现代分组密码通常使用 10 轮或更多轮这样的操作,这意味着操作列表会重复应用。经过足够多的轮次后,原始消息就被完全混合,可以抵御统计攻击,但任何知道密钥的人都可以反向运行所有操作以获取原始的解密消息。
First, long messages are broken up into small “blocks” of a fixed size, typically around 10-15 characters. Second, rather than simply adding a block of the message and the key together, each block is transformed several times according to a fixed set of rules that are similar to addition but cause the message and the key to be mixed up more aggressively. For example, the rules could say something like “add the first half of the key to the last half of the block, reverse the result and add the second half of the key to the last half of the block”—although in reality the rules are quite a bit more complicated. Modern block ciphers typically use 10 or more “rounds” of these operations, meaning the list of operations is applied repeatedly. After a sufficient number of rounds, the original message is well and truly mixed up and will resist statistical attacks, but anyone who knows the key can run all the operations in reverse to obtain the original, decrypted message.
在撰写本文时,最流行的分组密码是高级加密标准 (AES)。AES 可以与多种不同的设置一起使用,但典型的应用可能使用 16 个字符的分组、128 位密钥和 10 轮混合操作。
At the time of writing, the most popular block cipher is the Advanced Encryption Standard, or AES. AES can be used with a variety of different settings, but a typical application might use blocks of 16 characters, with 128-bit keys, and 10 rounds of mixing operations.
公开建立共享秘密
ESTABLISHING A SHARED SECRET IN PUBLIC
到目前为止,一切顺利。我们已经了解了互联网上绝大多数加密技术的实际工作原理:将消息切分成多个块,然后使用加法技巧的变体对每个块进行加密。但事实证明,这其实很容易。难点在于首先建立共享秘密。在上面的例子中,你和阿诺德、伊芙待在一个房间里,我们实际上作弊了一点——我们利用了你和阿诺德小时候是玩伴的事实,因此已经知道一个伊芙不可能知道的共享秘密(你们家的门牌号)。如果你、阿诺德和伊芙都是陌生人,而我们试图玩同一个游戏,会怎么样?有没有什么方法可以让你和阿诺德在伊芙不知情的情况下建立共享秘密?(记住,不许作弊——你不能悄悄地对阿诺德说任何话,也不能给他一张伊芙看不到的纸条。所有通信都必须公开。)
So far, so good. We've already found out how the vast majority of encryption on the internet actually works: chop the message up into blocks and use a variant of the addition trick to encrypt each block. But it turns out that this is the easy part. The hard part is establishing a shared secret in the first place. In the example given above, where you were in a room with Arnold and Eve, we actually cheated a bit—we used the fact that you and Arnold had been playmates as children and therefore already knew a shared secret (your family's house number) that Eve couldn't possibly know. What if you, Arnold, and Eve were all strangers, and we tried to play the same game? Is there any way that you and Arnold can set up a shared secret without Eve also knowing it? (Remember, no cheating—you can't whisper anything to Arnold or give him a note that Eve can't see. All communication must be public.)
这乍一看似乎不可能,但事实证明,有一个巧妙的方法可以解决这个问题。计算机科学家将这个解决方案称为“迪菲-赫尔曼密钥交换”,但我们称之为“颜料混合技巧”。
At first this might seem impossible, but it turns out that there is an ingenious way of solving the problem. Computer scientists call the solution Diffie-Hellman key exchange, but we're going to call it the paint-mixing trick.
调色技巧
The Paint-Mixing Trick
为了理解这个技巧,我们先暂时忘记交流信用卡号码,而是想象你想要分享的秘密是一种特定颜色的油漆。(是的,这有点奇怪,但我们很快就会发现,这也是思考问题的一种非常有用的方式。)现在假设你和阿诺德、伊芙在一个房间里,你们每个人都收藏了大量不同颜色的油漆罐。你们每个人都可以选择相同的颜色——有很多不同的颜色可供选择,每个人每种颜色都有许多罐。所以油漆用完不是问题。每个罐子都清楚地标有颜色,所以很容易给别人提供如何混合各种颜色的具体指示:你只需说“将一罐‘天蓝色’与六罐‘蛋壳色’和五罐‘海蓝宝石色’混合”。但每一种你能想到的色调都有成百上千种颜色,所以单凭肉眼观察不可能确定混合液中究竟包含哪些颜色。而且,你也不可能通过反复试验来确定混合液中到底包含哪些颜色,因为要尝试的颜色实在太多了。
To understand the trick, we're going to forget about communicating credit card numbers for a while, and instead imagine that the secret you would like to share is a particular color of paint. (Yes, this is a little weird, but as we'll soon see, it's also a very useful way of thinking about the problem.) So now suppose that you are in a room with Arnold and Eve and each of you has a huge collection of various pots of paint. You are each given the same choice of colors—there are many different colors available, and each of you has many pots of each color. So running out of paint is not going to be a problem. Each pot is clearly labeled with its color, so it's easy to give specific instructions to someone else about how to mix various colors together: you just say something like “mix one pot of ‘sky blue' with six pots of ‘eggshell' and five pots of ‘aquamarine'.” But there are hundreds or thousands of colors of every conceivable shade, so it's impossible to work out which exact colors went into a mixture just by looking at it. And it's impossible to work out which colors went into a mixture by trial and error, because there are just too many colors to try.
现在,游戏规则稍有变化。你们每个人都要在房间里用帘子隔开一个角落,用来存放你们的颜料,也可以在那里偷偷调配颜料而不被别人发现。但关于沟通的规则和以前一样:你、阿诺德和伊芙之间的任何沟通都必须公开进行。你们不能邀请阿诺德进入你们的私人调配区!另一条规则规定了如何分享调配好的颜料。你们可以将一批颜料交给房间里的其他人,但只能将这批颜料放在房间中央的地上,等待其他人来取。这意味着你永远无法确定谁会取走你的那批颜料。最好的办法是准备足够每个人使用的颜料,并在房间中央分别放置几批。这样,任何想要你的颜料的人都可以得到。这条规则实际上只是所有交流必须公开这一事实的延伸:如果你给了阿诺德某种混合物,但没有给夏娃,你就与阿诺德进行了某种“私人”交流,这是违反规则的。
Now, the rules of the game are going to change just a little bit. Each of you is going to have a corner of the room curtained off for privacy, as a place where you store your paint collection and where you can go to secretly mix paints without the others seeing. But the rules about communication are just the same as before: any communication between you, Arnold, and Eve must be out in the open. You can't invite Arnold into your private mixing area! Another rule regulates how you can share mixtures of paint. You can give a batch of paint to one of the other people in the room, but only by placing that batch on the ground in the middle of the room and waiting for someone else to pick it up. This means that you can never be sure who is going to pick up your batch of paint. The best way is to make enough for everybody, and leave several separate batches in the middle of the room. That way anyone who wants one of your batches can get it. This rule is really just an extension of the fact that all communication must be public: if you give a certain mixture to Arnold without giving it to Eve too, you have had some kind of “private” communication with Arnold, which is against the rules.
记住,这个调颜料游戏的目的是解释如何建立共享秘密。现在你或许会好奇,调颜料和密码学到底有什么关系,但别担心。你即将学习一个神奇的技巧,它实际上是计算机用来在互联网这样的公共场所建立共享秘密的!
Remember that this paint-mixing game is meant to explain how to establish a shared secret. At this point you may well be wondering what on earth mixing paints has got to do with cryptography, but don't worry. You are about to learn an amazing trick that is actually used by computers to establish shared secrets in a public place like the internet!
首先,我们需要了解游戏的目标。目标是你和阿诺德各自调配出相同的颜料,但不要告诉伊芙如何调配。如果你们达成了目标,我们就说你和阿诺德建立了一个“共享的秘密调配”。你们可以随意公开交谈,也可以在房间中央和你们的私人调配区之间来回搬运颜料罐。
First, we need to know the objective of the game. The objective is for you and Arnold to each produce the same mixture of paint, without telling Eve how to produce it. If you achieve this, we'll say that you and Arnold have established a “shared secret mixture.” You are allowed to have as much public conversation as you like, and you are also allowed to carry pots of paint back and forth between the middle of the room and your private mixing area.
现在,我们开始探索公钥密码学背后的精妙理念。我们的颜料混合技巧将分为四个步骤。
Now we begin our journey into the ingenious ideas behind public key cryptography. Our paint-mixing trick will be broken down into four steps.
步骤 1.您和 Arnold 各自选择一种“私人颜色”。
Step 1. You and Arnold each choose a “private color.”
你的私人颜色与你最终生成的共享秘密混合物不同,但它是共享秘密混合物的成分之一。你可以选择任何颜色作为你的私人颜色,但你必须记住它。显然,你的私人颜色几乎肯定会与阿诺德的不同,因为可供选择的颜色太多了。举个例子,假设你的私人颜色是淡紫色,而阿诺德的颜色是深红色。
Your private color is not the same thing as the shared secret mixture that you will eventually produce, but it will be one of the ingredients in the shared secret mixture. You can choose any color you want as your private color, but you have to remember it. Obviously, your private color will almost certainly be different from Arnold's, since there are so many colors to choose from. As an example, let's say your private color is lavender and Arnold's is crimson.
第 2 步。你们中的一个人公开宣布一种新的、不同的颜色的成分,我们称之为“公共颜色”。
Step 2. One of you publicly announces the ingredients of a new, different color that we'll call the “public color.”
再次强调,您可以选择任何您喜欢的颜色。假设您宣布公开颜色是雏菊黄。请注意,公开颜色只有一种(您和 Arnold 不会使用两种不同的颜色),而且 Eve 当然知道公开颜色是什么,因为您已经公开宣布了它。
Again, you can choose anything you like. Let's say you announce that the public color is daisy-yellow. Note that there is only one public color (not two separate ones for you and Arnold), and, of course, Eve knows what the public color is because you announce it publicly.
步骤3:你和阿诺德各自将一罐公共颜色和一罐私人颜色混合,制作出你们的“公共-私人混合色”。
Step 3. You and Arnold each create a mixture by combining one pot of the public color with one pot of your private color. This produces your “public-private mixture.”
显然,阿诺德的公共和私人空间混合颜色会与你的不同,因为他的私人空间颜色与你的不同。如果我们继续上面的例子,那么你的公共和私人空间混合颜色会包含一盆薰衣草色和一盆雏菊黄,而阿诺德的公共和私人空间混合颜色则包含深红色和雏菊黄。
Obviously, Arnold's public-private mixture will be different from yours, because his private color is different from yours. If we stick with the above example, then your public-private mixture will contain one pot each of lavender and daisy-yellow, whereas Arnold's public-private mixture consists of crimson and daisy-yellow.
此时,你和阿诺德想互相赠送你们的公私混合颜料样品,但请记住,直接把混合好的颜料给房间里的其他人是违反规则的。把颜料给别人的唯一方法是自己制作几批,然后放在房间中央,这样任何想要的人都可以拿走。你和阿诺德就是这么做的:你们每个人都制作几批公私混合颜料,然后放在房间中央。如果伊芙愿意,她可以偷一两批,但我们马上就会知道,这对她没有任何好处。下一页的图表展示了颜料混合戏法第三步之后的情况。
At this point, you and Arnold would like to give each other samples of your public-private mixtures, but remember it's against the rules to directly give a mixture of paint to one of the other people in the room. The only way to give a mixture to someone else is to make several batches of it and leave them in the middle of the room so that anyone who wants one can take it. This is exactly what you and Arnold do: each of you makes several batches of your public-private mixture and leaves them in the middle of the room. Eve can steal a batch or two if she wants, but as we will learn in a minute, they will do her no good at all. The figure on the following page shows the situation after this third step of the paint-mixing trick.
好了,现在我们有点进展了。如果你现在仔细思考,你可能会想到最后一个技巧,它能让你和 Arnold 各自创建一个相同的共享秘密组合,而不会让 Eve 知道这个秘密。答案如下:
OK, now we're getting somewhere. If you think hard at this point, you might see the final trick that would allow you and Arnold to each create an identical shared secret mixture without letting Eve in on the secret. Here's the answer:
混合颜料技巧,步骤 3:公私混合物可供任何需要的人使用。
The paint-mixing trick, step 3: The public-private mixtures are available to anyone who wants them.
步骤4:你拿起一批阿诺德的公私混合颜料,带回你的角落。现在,加入一罐你的私人颜料。与此同时,阿诺德也拿起一批你的公私混合颜料,带回他的角落,加入一罐他私人颜料。
Step 4. You pick up a batch of Arnold's public-private mixture and take it back to your corner. Now add one pot of your private color. Meanwhile, Arnold picks up a batch of your public-private mixture and takes it back to his corner, where he adds it to a pot of his private color.
令人惊讶的是,你们俩竟然调出了一模一样的混合色!我们来验证一下:你把你的私人颜色(淡紫色)加到了阿诺德的公共-私人混合色(深红色和雏菊黄)中,最终混合色是1色淡紫色、1色深红色和1色雏菊黄。那阿诺德的最终混合色呢?他把他的私人颜色(深红色)加到了你的公共-私人混合色(淡紫色和雏菊黄)中,最终混合色是1色深红色、1色淡紫色和1色雏菊黄。这和你的最终混合色一模一样。这真的是一个共享的秘密混合色。下一页的图展示了颜料混合技巧最后一步之后的情况。
Amazingly, you have both just created identical mixtures! Let's check: you added your private color (lavender) to Arnold's public-private mixture (crimson and daisy-yellow), resulting in a final mixture of 1 lavender, 1 crimson, 1 daisy-yellow. What about Arnold's final mixture? Well, he added his private color (crimson) to your public-private mixture (lavender and daisy-yellow), resulting in a final mixture of 1 crimson, 1 lavender, 1 daisy-yellow. This is exactly the same as your final mixture. It really is a shared secret mixture. The figure on the next page shows the situation after this final step of the paint-mixing trick.
那么,Eve 怎么办?为什么她不能批量制作这种共享的秘密混合色?原因是她不知道你的私人颜色和 Arnold 的私人颜色,而她至少需要其中一种才能创建共享的秘密混合色。你和 Arnold 挫败了她的计划,因为你们从未将各自的私人颜色单独暴露在房间中央。相反,你们各自将自己的私人颜色与公开颜色混合后再公开,Eve 无法“分解”这些公开-私人混合色,从而获得其中一种私人颜色的纯净样本。
Now, what about Eve? Why can't she create a batch of this shared secret mixture? The reason is that she doesn't know your private color or Arnold's private color, and she needs at least one of them to create the shared secret mixture. You and Arnold have thwarted her, because you never left your private colors exposed, on their own, in the middle of the room. Instead, you each combined your private color with the public color before exposing it, and Eve has no way of “unmixing” the public-private mixtures to obtain a pure sample of one of the private colors.
混合颜料技巧,步骤 4:只有您和 Arnold 可以通过组合箭头所示的混合物来制作共享的秘密颜色。
The paint-mixing trick, step 4: Only you and Arnold can make the shared secret color, by combining the mixtures shown by arrows.
因此,Eve只能访问两种公私混合色。如果她将你的一批公私混合色与 Arnold 的一批公私混合色混合,结果将包含 1 色深红色、1 色淡紫色和 2 色雏菊黄。换句话说,与共享的秘密混合色相比,Eve 的混合色多了一种雏菊黄。她的混合物太黄了,而且由于没有办法“取消混合”颜料,所以她无法去除多余的黄色。你可能认为 Eve 可以通过添加更多的深红色和淡紫色来解决这个问题,但请记住,她不知道你的私有颜色,所以她不知道这些是需要添加的颜色。她只能添加深红色加雏菊黄或淡紫色加雏菊黄的组合,而这些组合总是会导致她的混合物太黄。
Thus, Eve has access only to the two public-private mixtures. If she mixes one batch of your public-private mixture with one batch of Arnold's public-private mixture, the result will contain 1 crimson, 1 lavender, and 2 daisy-yellow. In other words, compared to the shared secret mixture, Eve's mixture has an extra daisy-yellow. Her mixture is too yellow, and because there's no way to “unmix” paint, she can't remove that extra yellow. You might think Eve could get around this by adding more crimson and lavender, but remember she doesn't know your private colors, so she wouldn't know that these are the colors that need to be added. She can only add the combination of crimson plus daisy-yellow or lavender plus daisy-yellow, and these will always result in her mixture being too yellow.
用数字混合颜料
Paint-Mixing with Numbers
如果您理解了调漆技巧,您就理解了计算机如何在互联网上建立共享秘密的本质。但是,当然,它们实际上并不使用颜料。计算机使用数字,而使用数学来混合数字。它们实际使用的数学并不是太复杂,但是一开始就足够复杂以至于令人困惑。因此,为了下一步了解如何在互联网上建立共享秘密,我们将使用一些“假装”数学。真正的要点是,要将调漆技巧转化为数字,我们需要一个单向动作:可以做但不能撤消的事情。在调漆技巧中,单向动作是“混合颜料”。将一些颜料混合在一起形成新颜色很容易,但不可能“取消混合”它们并恢复原来的颜色。这就是为什么调漆是一种单向动作。
If you understand the paint-mixing trick, you understand the essence of how computers establish shared secrets on the internet. But, of course, they don't really use paint. Computers use numbers, and to mix the numbers they use mathematics. The actual math they use isn't too complicated, but it's complicated enough to be confusing at first. So, for our next step toward understanding how shared secrets are established on the internet, we will use some “pretend” math. The real point is that to translate the paint-mixing trick into numbers, we need a one-way action: something that can be done, but can't be undone. In the paint-mixing trick the one-way action was “mixing paint.” It's easy to mix some paints together to form a new color, but it's impossible to “unmix” them and get the original colors back. That's why paint-mixing is a one-way action.
我们之前发现要用一些假装的数学。我们要假装的是:将两个数字相乘是一个单向操作。我相信你已经意识到了,这绝对是假装的。乘法的逆运算是除法,只需执行除法就可以很容易地撤销乘法。例如,如果我们从数字 5 开始,然后将其乘以 7,得到 35。撤销这个乘法很容易,只需从 35 开始,然后除以 7。这样就回到了我们最初的 5。
We found out earlier that we would be using some pretend math. Here is what we are going to pretend: multiplying two numbers together is a one-way action. As I'm sure you realize, this is definitely a pretense. The opposite of multiplication is division, and it's easy to undo a multiplication just by performing a division. For example, if we start with the number 5 and then multiply it by 7, we get 35. It's easy to undo this multiplication by starting with 35 and dividing by 7. That gets us back to the 5 we started with.
尽管如此,我们还是要继续玩你、阿诺德和伊芙之间的另一个游戏。这次,我们假设你们都知道如何做乘法,但没有人知道如何做除法。目标与之前几乎相同:你和阿诺德试图建立一个共享秘密,但这次共享秘密是一个数字,而不是一种颜色。遵循通常的通信规则:所有通信必须公开,这样伊芙才能听到你和阿诺德之间的任何对话。
But despite that, we are going to stick with the pretense and play another game between you, Arnold, and Eve. And this time, we'll assume you all know how to multiply numbers together, but none of you knows how to divide one number by another number. The objective is almost the same as before: you and Arnold are trying to establish a shared secret, but this time the shared secret will be a number rather than a color of paint. The usual communication rules apply: all communication must be public, so Eve can hear any conversations between you and Arnold.
好的,现在我们要做的就是将混合颜料的技巧转化为数字:
OK, now all we have to do is translate the paint-mixing trick into numbers:
步骤 1.您和 Arnold 不选择“私人颜色”,而是各自选择一个“私人号码”。
Step 1. Instead of choosing a “private color,” you and Arnold each choose a “private number.”
假设你选择 4,而阿诺德选择 6。现在回想一下颜料混合戏法的剩余步骤:宣布公开颜色,进行公私混合,公开地将你的公私混合色与阿诺德的公私混合色交换,最后将你的私人颜色添加到阿诺德的公私混合色中,得到共享的秘密颜色。用乘法而不是颜料混合作为单向操作,将其转化为数字应该不难。在继续阅读之前,请花几分钟时间看看你是否能自己算出这个例子。
Let's say you choose 4 and Arnold chooses 6. Now think back to the remaining steps of the paint-mixing trick: announcing the public color, making a public-private mixture, publicly swapping your public-private mixture with Arnold's, and finally adding your private color to Arnold's public-private mixture to get the shared secret color. It shouldn't be too hard to see how to translate this into numbers, using multiplication as the one-way action instead of paint-mixing. Take a couple of minutes to see if you can work out this example for yourself, before reading on.
解决方案并不难;你们都已经选择了各自的私人数字(4 和 6),所以下一步是
The solution isn't too hard to follow; you've already both chosen your private numbers (4 and 6), so the next step is
第 2 步。你们中的一个人宣布一个“公共号码”(而不是混合颜料技巧中的公共颜色)。
Step 2. One of you announces a “public number” (instead of the public color in the paint-mixing trick).
假设你选择7作为公众号。
Let's say you choose 7 as the public number.
颜料混合技巧的下一步是创造一种公私混合。但我们已经决定,我们不是混合颜料,而是乘以数字。所以你只需要
The next step in the paint-mixing trick was to create a public-private mixture. But we already decided that instead of mixing paints we would be multiplying numbers. So all you have to do is
步骤3.将您的私人数字(4)乘以公共数字(7),得到您的“公私数字”28。
Step 3. Multiply your private number (4) and the public number (7) to get your “public-private number,” 28.
你可以公开宣布这一点,这样阿诺德和伊芙都知道你的公私比是28(以后再也不用提着油漆罐到处跑了)。阿诺德也对他的私人数字做了同样的事情:他用这个数字乘以公开数字,然后公布他的公私比,结果是6 × 7,也就是42。下一页的图表显示了此时的情况。
You can announce this publicly so that Arnold and Eve both know your public-private number is 28 (there's no need to carry pots of paint around anymore). Arnold does the same thing with his private number: he multiplies it by the public number, and announces his public-private number, which is 6 × 7, or 42. The figure on the following page shows the situation at this point in the process.
还记得颜料混合技巧的最后一步吗?你取了 Arnold 的公私混合色,然后加入一罐你自己的颜料,就得到了共享的秘密颜色。这里也发生了同样的事情,只不过用的是乘法而不是颜料混合:
Remember the last step of the paint-mixing trick? You took Arnold's public-private mixture, and added a pot of your private color to produce the shared secret color. Exactly the same thing happens here, using multiplication instead of paint-mixing:
步骤 4.取 Arnold 的公私数字 42,乘以您的私人数字 4,得出共享秘密数字168。
Step 4. You take Arnold's public-private number, which is 42, and multiply it by your private number, 4, which results in the shared secret number, 168.
与此同时,阿诺德将你的公私数字 28 乘以他的私人数字 6,令人惊奇的是,他得到了相同的共享秘密数字,因为 28 × 6 = 168。最终结果如对面页的图所示。
Meanwhile, Arnold takes your public-private number, 28, and multiplies it by his private number, 6, and—amazingly—gets the same shared secret number, since 28 × 6 = 168. The final result is shown in the figure on the facing page.
号码混合技巧,步骤 3:公私号码可供任何需要的人使用。
The number-mixing trick, step 3: The public-private numbers are available to anyone who wants them.
实际上,如果你以正确的方式思考,这并不奇怪。当 Arnold 和你设法同时产生相同的共享秘密颜色时,这是因为你们混合了相同的三种原始颜色,但顺序不同:你们各自保密一种颜色,并将其与公开的另外两种颜色混合。同样的事情也发生在数字上。你们通过将相同的三个数字相乘得出了相同的共享秘密:4、6 和 7。(是的,你可以自己检查,4 × 6 × 7 = 168。)但是你通过将 4 保密并将其与 Arnold 宣布的公开的 6 和 7 的混合色(即 42)“混合”(即相乘)得出了共享秘密。另一方面,Arnold通过将 6 保密并将其与你宣布的公开的 4 和 7 的混合色(即 28)混合得出了共享秘密。
Actually, when you think about it the right way, this isn't amazing at all. When Arnold and you managed to both produce the same shared secret color, it was because you mixed together the same three original colors, but in a different order: each of you kept one of the colors private, combining it with a publicly available mixture of the other two. The same thing has happened here with numbers. You both arrived at the same shared secret by multiplying together the same three numbers: 4, 6, and 7. (Yes, as you can check for yourself, 4 × 6 × 7 = 168.) But you arrived at the shared secret by keeping 4 private and “mixing” (i.e., multiplying) it with the publicly available mixture of 6 and 7 (i.e., 42) that had been announced by Arnold. On the other hand, Arnold arrived at the shared secret by keeping 6 private and mixing it with the publicly available mixture of 4 and 7 (i.e., 28) that you had announced.
就像我们在颜料混合戏法中所做的那样,现在让我们验证一下伊芙是否真的能够解开共享秘密。伊芙会听到每个公私数字的数值。所以她听到你说“28”,阿诺德说“42”。而且她也知道公开的数字是7。所以,如果伊芙会除法,她就能立刻解开你所有的秘密,只需观察28÷7=4和42÷7=6即可。然后她可以通过计算4×6×7=168来计算共享秘密。然而,幸运的是,我们在这个游戏中使用了假装数学:我们假设乘法是单向操作,因此伊芙不会除法。所以她只能知道28、42和7这几个数字。她可以把其中一些数字相乘,但这并不能让她知道任何关于共享秘密的信息。例如,如果她计算出 28 × 42 = 1176,那就大错特错了。就像在调颜料游戏中她的结果太黄一样,这里她的结果中 7 太多了。共享密钥中只有一个 7 的因数,因为 168 = 4 × 6 × 7。但伊芙破解密钥的尝试却有两个 7 的因数,因为 1176 = 4 × 6 × 7 × 7。而她根本无法去掉那个多余的 7,因为她不会做除法。
Just as we did in the paint-mixing trick, let's now verify that Eve has no chance of working out the shared secret. Eve gets to hear the value of each public-private number as it is announced. So she hears you say “28,” and Arnold say “42.” And she also knows the public number, which is 7. So if Eve knew how to do division, she could work out all your secrets immediately, just by observing that 28 ÷ 7 = 4, and 42 ÷ 7 = 6. And she could go on to compute the shared secret by calculating 4 × 6 × 7 = 168. However, luckily, we are using pretend math in this game: we assumed that multiplication was a one-way action and therefore Eve doesn't know how to divide. So she is stuck with the numbers 28, 42, and 7. She can multiply some of them together, but that doesn't tell her anything about the shared secret. For example, if she takes 28 × 42 = 1176, she is way off. Just as in the paint-mixing game her result was too yellow, here her result has too many 7's. The shared secret has only one factor of 7 in it, since 168 = 4 × 6 × 7. But Eve's attempt at cracking the secret has two factors of 7 in it, since 1176 = 4 × 6 × 7 × 7. And there's no way she can get rid of that extra 7, since she doesn't know how to do division.
数字混合技巧,步骤 4:只有您和 Arnold 可以通过将箭头所示的数字相乘来生成共享的秘密数字。
The number-mixing trick, step 4: Only you and Arnold can make the shared secret number, by multiplying together the numbers shown by arrows.
现实生活中的颜料混合
Paint-Mixing in Real Life
我们已经涵盖了计算机在互联网上建立共享秘密所需的所有基本概念。“颜料与数字混合”方案的唯一缺陷在于它使用了“假装数学”,即我们假装任何一方都不会做除法。为了完成这个方案,我们需要一个现实生活中的数学运算,它很容易做到(比如混合颜料),但实际上不可能撤销。
We have now covered all of the fundamental concepts needed for computers to establish shared secrets on the internet. The only flaw in the paint-mixing-with-numbers scheme is that it uses “pretend math,” in which we pretended that none of the parties could do division. To complete the recipe, we need a real-life math operation that is easy to do (like mixing paint) but practically impossible to undo
(比如拆开颜料)。在现实生活中,计算机进行这样的操作时,混合操作称为离散指数运算,拆开操作称为离散对数运算。由于目前还没有已知的方法可以让计算机高效地计算离散对数,因此离散指数运算正是我们正在寻找的那种单向操作。为了正确解释离散指数运算,我们需要两个简单的数学概念。我们还需要写出一些公式。如果您不喜欢公式,请跳过本节的其余部分——您已经了解了几乎所有关于这个主题的知识。另一方面,如果您真的想知道计算机是如何实现这种神奇的,请继续阅读。
(like unmixing paint). When computers do this in real life, the mixing operation is a thing called discrete exponentiation and the unmixing operation is called the discrete logarithm. And because there is no known method for a computer to calculate discrete logarithms efficiently, discrete exponentiation turns out to be just the kind of oneway action we are looking for. To explain discrete exponentiation properly, we're going to need two simple mathematical ideas. And we'll also need to write a few formulas. If you don't like formulas, just skip the rest of this section—you already understand almost everything about this topic. On the other hand, if you really want to know how computers do this magic, read on.
我们需要的第一个重要的数学概念叫做“钟表算术”。这其实是我们都很熟悉的:钟面上只有12个数字,所以每当时针超过12时,它就从1开始重新计数。一项活动从10点开始,持续4个小时,在2点结束,所以在这个12小时制的钟表中,我们可以说10 + 4 = 2。在数学中,钟表算术的工作原理与之相同,但有两个细节不同:(i) 钟面的大小可以是任意数字(而不是我们熟悉的普通钟面上的12个数字);(ii) 数字从0开始计数,而不是从1开始。
The first important math idea we need is called clock arithmetic. This is actually something we are all familiar with: there are only 12 numbers on a clock, so every time the hour hand goes past 12, it starts counting again from 1. An activity that starts at 10 o'clock and lasts 4 hours finishes at 2 o'clock, so we might say that 10 + 4 = 2 in this 12-hour clock system. In mathematics, clock arithmetic works the same way, except for two details: (i) the size of the clock can be any number (rather than the familiar 12 numbers on a regular clock), and (ii) the numbers start counting from 0 rather than 1.
下一页的图表以 7 号钟为例。注意,钟面上的数字分别为 0、1、2、3、4、5 和 6。要用 7 号钟进行时钟运算,只需像平常一样进行数字的加法和乘法即可——但只要得出答案,就只计算除以 7 后的余数。因此,要计算 12 + 6,我们首先像平常一样进行加法运算,得到 18。然后我们注意到 7 除以 18 两次(等于 14),余数为 4。所以最终答案是
The figure on the next page gives an example using the clock size 7. Note that the numbers on the clock are 0, 1, 2, 3, 4, 5, and 6. To do clock arithmetic with clock size 7, just add and multiply numbers together as normal—but whenever an answer is produced, you only count the remainder after dividing by 7. So to compute 12 + 6, we first do the addition as normal, obtaining 18. Then we notice that 7 goes into 18 twice (making 14), with 4 left over. So the final answer is
12 + 6 = 4(时钟尺寸 7)
12 + 6 = 4 (clock size 7)
在下面的例子中,我们将使用 11 作为时钟大小。(正如后面所讨论的,实际实现中的时钟大小会大得多。我们使用较小的时钟大小是为了尽可能简化解释。)除以 11 后取余数并不难,因为 11 的倍数都有重复的数字,比如 66 和 88。以下是一些使用时钟大小 11 进行计算的示例:
In the examples below, we'll be using 11 as the clock size. (As discussed later, the clock size in a real implementation would be much, much larger. We are using a small clock size to keep the explanation as simple as possible.) Taking the remainder after dividing by 11 isn't too hard, since the multiples of 11 all have repeated digits like 66 and 88. Here are a few examples of calculations with a clock size of 11:
7 + 9 + 8 = 24 = 2(时钟大小 11)
8 × 7 = 56 = 1(时钟大小 11)
7 + 9 + 8 = 24 = 2 (clock size 11)
8 × 7 = 56 = 1 (clock size 11)
我们需要的第二个数学概念是幂表示法。这没什么特别的:它只是一种快速写下多个相同数字乘法的方法。与其写成 6 × 6 × 6 × 6(也就是 6 连续 4 次自乘),不如写成 6 4。而且,你可以将幂表示法与时钟运算结合起来。例如,
The second math idea we need is power notation. This is nothing fancy: it's just a quick way of writing down lots of multiplications of the same number. Instead of writing 6 × 6 × 6 × 6, which is just 6 multiplied by itself 4 times in a row, you can write 64. And you can combine power notation with clock arithmetic. For example,
左图:当使用 7 号时钟时,数字 12 被简化为数字 5——只需从零开始,按箭头所示顺时针方向数 12 个单位即可。右图:同样使用 7 号时钟,我们发现 12 + 6 = 4——从左图中结束的 5 开始,顺时针方向再加 6 个单位。
Left: When using a clock size of 7, the number 12 is simplified to the number 5—just start at zero and count 12 units in a clockwise direction, as shown by the arrow. Right: Again using a clock size of 7, we find that 12 + 6 = 4—starting at 5, where we ended in the left figure, add on another 6 units in clockwise direction.
下一页的表格显示了使用时钟尺寸 11 时 2、3 和 6 的前十个幂。这些将在我们即将进行的示例中有用。因此,在深入研究之前,请确保您熟悉此表的生成方式。让我们看一下最后一列。此列中的第一个条目是 6,与 6 1相同。下一个条目表示 6 2或 36,但由于我们使用的是时钟尺寸 11,而 36 比 33 大 3,因此表中的条目为 3。要计算此列中的第三个条目,您可能认为我们需要计算 6 3 = 6 - 6 - 6,但有一种更简单的方法。我们已经计算过我们感兴趣的时钟尺寸的 6 2 ——结果是 3 。要得到 6 3,我们只需将之前的结果乘以 6 即可。这样就得到了 3 × 6 = 18 = 7 (时钟尺寸 11 )。下一个元素是 7 × 6 = 42 = 9 (时钟尺寸 11 ),依此类推。
The table on the following page shows the first ten powers of 2, 3, and 6 when using clock size 11. These will be useful in the example we're about to work through. So before plunging on, make sure you're comfortable with how this table was generated. Let's take a look at the last column. The first entry in this column is 6, which is the same thing as 61. The next entry represents 62, or 36, but since we're using clock size 11 and 36 is 3 more than 33, the entry in the table is a 3. To calculate the third entry in this column, you might think that we need to work out 63 = 6 ? 6 ? 6, but there is an easier way. We have already computed 62 for the clock size we're interested in—it turned out to be 3. To get 63, we just need to multiply the previous result by 6. This gives 3 × 6 = 18 = 7 (clock size 11). And the next entry is 7 × 6 = 42 = 9 (clock size 11), and so on down the column.
好的,我们终于可以创建一个共享秘密了,就像现实生活中的计算机一样。像往常一样,你和 Arnold 会尝试共享一个秘密,而 Eve 则会窃听并试图找出这个秘密。
OK, we are finally ready to establish a shared secret, as used by computers in real life. As usual, you and Arnold will be trying to share a secret, while Eve eavesdrops and tries to work out what the secret is.
步骤 1.您和 Arnold 各自选择一个私人号码。
Step 1. You and Arnold each separately choose a private number.
该表显示了使用时钟大小 11 时 2、3 和 6 的前十个幂。如文中所述,每个条目都可以通过一些非常简单的算术从其上方的条目计算出来。
This table shows the first ten powers of 2, 3, and 6 when using clock size 11. As explained in the text, each entry can be computed from the one above it by some very simple arithmetic.
为了尽可能简化数学计算,我们将在本例中使用非常小的数字。假设你选择 8 作为你的私人数字,而 Arnold 选择 9。这两个数字——8 和 9——本身并不是共享秘密,但就像你在颜料混合技巧中选择的私人颜色一样,这些数字将被用作“混合”共享秘密的成分。
To keep the math as easy as possible, we'll use very small numbers in this example. So suppose you choose 8 as your private number, and Arnold chooses 9. These two numbers—8 and 9—are not themselves shared secrets, but just like the private colors you chose in the paint-mixing trick, these numbers will be used as ingredients to “mix up” a shared secret.
第 2 步。您和 Arnold 公开同意两个公共数字:时钟尺寸(在此示例中我们将使用 11)和另一个数字,称为基数(我们将使用基数 2)。
Step 2. You and Arnold publicly agree on two public numbers: a clock size (we'll use 11 in this example) and another number, called the base (we'll use the base 2).
这两个公开的数字——11 和 2——类似于你和 Arnold 在调漆魔术开始时约定的公开颜色。请注意,调漆的类比在这里确实有点不成立:我们只需要一种公开颜色,而需要两个公开的数字。
These public numbers—11 and 2—are analogous to the public color that you and Arnold agreed on at the start of the paint-mixing trick. Note that the paint-mixing analogy does break down a little here: whereas we needed only one public color, two public numbers are needed.
步骤 3.您和 Arnold 分别通过将您的私人号码与公共号码混合,使用幂符号和时钟算术来创建一个公私号码(PPN)。
Step 3. You and Arnold each separately create a public-private number (PPN) by mixing up your private number with the public numbers, using power notation and clock arithmetic.
具体来说,混合按照配方进行
Specifically, the mixing is done according to the formula
PPN = 基本私有号码(时钟大小)
PPN = baseprivate number (clock size)
这个公式用文字表达出来可能有点奇怪,但实际操作起来很简单。在我们的例子中,我们可以通过查阅上一页的表格来计算答案:
This formula might look a little weird written out in words, but it's simple in practice. In our example, we can work out the answers by consulting the table on the previous page:
您的 PPN = 2 8 = 3(时钟大小 11) Arnold 的 PPN = 2 9 = 6(时钟大小 11)
your PPN = 28 = 3 (clock size 11) Arnold's PPN = 29 = 6 (clock size 11)
您可以在下一页的图中看到此步骤之后的情况。这些公私数字与你在颜料混合技巧第三步中制作的“公私混合色”完全类似。在那里,你将一罐公用颜色与一部分你的私人颜色混合,制成了你的公私混合色。在这里,你使用幂符号和时钟算法将你的私人数字与公用数字混合。
You can see the situation after this step in the figure on the following page. These public-private numbers are precisely analogous to the “public-private mixtures” that you made in the third step of the paint-mixing trick. There, you mixed one pot of the public color with one part of your private color to make your public-private mixture. Here, you have mixed your private number with the public numbers using power notation and clock arithmetic.
步骤 4.您和 Arnold 分别获取对方的公私号码,并将其与自己的私人号码混合。
Step 4. You and Arnold each separately take the other's public-private number and mix it in with your own private number.
这是根据公式完成的
This is done according to the formula
共享秘密=其他人的PPN私人号码(时钟大小)
shared secret = other person's PPNprivate number (clock size)
再次,用文字写出来看起来有点奇怪,但通过查阅上一页的表格,用数字就可以简单地算出来:
Again this looks a little weird written out in words, but by consulting the table on the previous page, it works out simply in numbers:
你的共享秘密 = 6 8 = 4(时钟大小 11)
Arnold 的共享秘密 = 3 9 = 4(时钟大小 11)
your shared secret = 68 = 4 (clock size 11)
Arnold's shared secret = 39 = 4 (clock size 11)
最终情况如第57页的图所示。
The final situation is shown in the figure on page 57.
当然,你的共享秘密和 Arnold 的共享秘密最终会是同一个数字(在本例中为 4)。这需要一些复杂的数学知识,但基本思路与之前相同:尽管你们混合材料的顺序不同,但你和 Arnold 使用的都是相同的材料,因此产生了相同的共享秘密。
Naturally, your shared secret and Arnold's shared secret end up being the same number (in this case, 4). It depends on some sophisticated mathematics in this case, but the basic idea is the same as before: although you mixed your ingredients in a different order, both you and Arnold used the same ingredients and therefore produced the same shared secret.
现实生活中的数字混合,步骤 3:通过幂和钟表算法计算出的公私数字(3 和 6)可供任何想要的人使用。3 下方显示的“2 8 ”提醒我们 3 是如何计算的,但在 11 号钟表上 3 = 2 8的事实并未公开。同样,6 下方显示的“2 9 ”仍然是私有的。
Real-life number-mixing, step 3: The public-private numbers (3 and 6), computed using powers and clock arithmetic, are available to anyone who wants them. The “28” shown below the 3 reminds us how the 3 was computed, but the fact that 3 = 28 in clock size 11 is not made public. Similarly, the “29” shown below the 6 remains private.
和这个魔术的早期版本一样,伊芙被冷落了。她知道两个公开的数字(2 和 11),也知道两个公私结合的数字(3 和 6)。但她无法利用这些知识来计算共享的秘密数字,因为她无法获取你和阿诺德掌握的任何秘密成分(即私人数字)。
And as with the earlier versions of this trick, Eve is left out in the cold. She knows the two public numbers (2 and 11), and she also knows the two public-private numbers (3 and 6). But she can't use any of this knowledge to compute the shared secret number, because she can't access either of the secret ingredients (the private numbers) held by you and Arnold.
公钥密码学实践
PUBLIC KEY CRYPTOGRAPHY IN PRACTICE
颜料混合技巧的最终版本,即使用幂和时钟算法混合数字,是计算机在互联网上实际建立共享秘密的方法之一。这里描述的具体方法称为迪菲-赫尔曼密钥交换协议,以惠特菲尔德·迪菲和马丁·赫尔曼的名字命名,他们于 1976 年首次发表了该算法。每当您访问一个安全网站(以“https:”而不是“http:”开头的网站)时,您自己的计算机及其与之通信的网络服务器都会使用迪菲-赫尔曼协议或几种类似工作的替代协议之一创建一个共享秘密。一旦这个共享秘密建立,两台计算机就可以使用前面描述的加法技巧的变体来加密它们之间的所有通信。
The final version of the paint-mixing trick, mixing numbers using powers and clock arithmetic, is one of the ways that computers actually establish shared secrets on the internet. The particular method described here is called the Diffie-Hellman key exchange protocol, named for Whitfield Diffie and Martin Hellman, who first published the algorithm in 1976. Whenever you go to a secure website (one that starts with “https:” rather than “http:”), your own computer and the web server it's communicating with create a shared secret, using the Diffie-Hellman protocol or one of several alternatives that work in a similar way. And once this shared secret is established, the two computers can encrypt all their communication using a variant of the addition trick described earlier.
现实生活中的数字混合,步骤 4:只有您和 Arnold 可以通过将箭头所示的元素组合在一起,使用幂和时钟算术来制作共享的秘密数字。
Real-life number-mixing, step 4: Only you and Arnold can make the shared secret number, by combining together the elements shown with arrows, using powers and clock arithmetic.
重要的是要意识到,在实际使用Diffie-Hellman协议时,实际涉及的数字远大于我们在此讨论的示例。我们使用了一个非常小的时钟大小(11),以便计算简单。但是,如果您选择一个较小的公共时钟大小,那么可能的私有数字的数量也会很少(因为您只能使用小于时钟大小的私有数字)。这意味着有人可以使用计算机尝试所有可能的私有数字,直到找到一个可以生成您的公私数字的私有数字。在上面的例子中,可能的私有数字只有11个,因此这个系统极易被破解。相比之下,Diffie-Hellman协议的实际实现通常使用几百位长的时钟大小,这将产生难以想象的大量可能的私有数字(远远超过一万亿亿)。即便如此,选择公共数字也必须谨慎,以确保它们具有正确的数学性质——如果您对此感兴趣,请查看下一页的方框。
It's important to realize that when the Diffie-Hellman protocol is used in practice, the actual numbers involved are far larger than the examples we worked through here. We used a very small clock size (11), so that the calculations would be easy. But if you choose a small public clock size, then the number of possible private numbers is also small (since you can only use private numbers that are smaller than the clock size). And that means someone could use a computer to try out all the possible private numbers until they find one that produces your public-private number. In the example above, there were only 11 possible private numbers, so this system would be ludicrously easy to crack. In contrast, real implementations of the Diffie-Hellman protocol typically use a clock size that is a few hundred digits long, which creates an unimaginably large number of possible private numbers (much more than a trillion trillion). And even then, the public numbers must be chosen with some care, to make sure they have the correct mathematical properties—check out the box on the next page if you're interested in this.?
Diffie-Hellman 公有数最重要的性质是时钟大小必须是素数——因此它除了 1 和它本身之外没有其他因数。另一个有趣的要求是基数必须是时钟大小的原根。这意味着基数的幂最终会循环遍历时钟上的所有可能值。如果你查看第 54 页的表格,你会注意到 2 和 6 都是 11 的原根,但 3 不是——3 的幂循环遍历值 3、9、5、4、1,而忽略了 2、6、7、8 和 10。
The most important property for Diffie-Hellman public numbers is that the clock size must be a prime number—so it has no divisors other than 1 and itself. Another intriguing requirement is that the base must be a primitive root of the clock size. This means that the powers of the base eventually cycle through every possible value on the clock. If you look at the table on page 54, you'll notice that 2 and 6 are both primitive roots of 11, but 3 is not—the powers of 3 cycle through the values 3,9,5,4,1 and miss 2,6, 7, 8, and 10.
在为 Diffie-Hellman 协议选择时钟大小和基准时,必须满足某些数学特性。
When choosing a clock size and base for the Diffie-Hellman protocol, certain mathematical properties must be satisfied.
这里描述的Diffie-Hellman方法只是众多巧妙的明信片(电子)通信技巧之一。计算机科学家将Diffie-Hellman称为密钥交换算法。其他公钥算法的工作原理不同,它们允许你使用收件人公布的公开信息直接加密发送给目标收件人的消息。相比之下,密钥交换算法允许你使用收件人的公开信息建立共享密钥,但加密本身是通过加法技巧完成的。对于大多数互联网通信而言,后一种选择(我们在本章中学习过)是更可取的,因为它所需的计算能力要少得多。
The Diffie-Hellman approach described here is just one of many cunning techniques for communicating via (electronic) postcards. Computer scientists call Diffie-Hellman a key exchange algorithm. Other public key algorithms work differently and allow you to directly encrypt a message for your intended recipient, using public information announced by that recipient. In contrast, a key exchange algorithm allows you to establish a shared secret using the public information from the recipient, but the encryption itself is done via the addition trick. For most communication on the internet, this latter option—the one we have learned about in this chapter—is preferable, as it requires much less computational power.
但有些应用需要完全成熟的公钥加密技术。这些应用中最有趣的可能就是数字签名,我们将在第 9 章中对此进行解释。阅读该章时你会发现,完全成熟的公钥加密技术的思想与我们已经看到的类似:秘密信息以数学上不可逆的方式与公开信息“混合”,就像油漆颜色可以不可逆地混合一样。最著名的公钥密码系统是 RSA,它以首先发布它的三位发明者的名字命名:Ronald Rivest、Adi S Hamir 和 Leonard Adleman。第 9 章使用 RSA 作为数字签名工作原理的主要示例。
But there are some applications in which fully fledged public key cryptography is required. Perhaps the most interesting of these applications is digital signatures, which will be explained in chapter 9. As you will discover when you read that chapter, the flavor of the ideas in the fully fledged type of public key cryptography is similar to what we have already seen: secret information is “mixed” with public information in a mathematically irreversible way, just as paint colors can be mixed irreversibly. The most famous public key cryptosystem is the one known as RSA, after the three inventors who first published it: Ronald Rivest, Adi Shamir, and Leonard Adleman. Chapter 9 uses RSA as the main example of how digital signatures work.
这些早期公钥算法的发明背后有一个引人入胜且复杂的故事。Diffie 和 Hellman 确实是最早在 1976 年发布 Diffie-Hellman 算法的人。Rivest、Shamir 和 Adleman 确实是最早在 1978 年发布 RSA 算法的人。但这并非故事的全部!后来人们发现,英国政府其实早在几年前就已知晓类似的系统。不幸的是,Diffie-Hellman 和 RSA 的这些前身的发明者,正是在英国政府通信实验室政府通信总部 (GCHQ) 工作的数学家。他们的工作成果被记录在秘密的内部文件中,直到 1997 年才解密。
There is a fascinating and complex story behind the invention of these early public key algorithms. Diffie and Hellman were indeed the first people to publish Diffie-Hellman, in 1976. Rivest, Shamir, and Adleman were indeed the first to publish RSA, in 1978. But that is not the whole story! It was later discovered that the British government had already known of similar systems for several years. Unfortunately for the inventors of these precursors to Diffie-Hellman and RSA, they were mathematicians working in the British government's communications laboratory, GCHQ. The results of their work were recorded in secret internal documents and were not declassified until 1997.
RSA、Diffie-Hellman 和其他公钥密码系统不仅仅是一些巧妙的构想。它们已经发展成为商业技术和互联网标准,对企业和个人都具有重要意义。如果没有公钥密码技术,我们每天进行的绝大多数在线交易都无法安全完成。RSA 的发明者在 20 世纪 70 年代为他们的系统申请了专利,直到 2000 年底才到期。专利到期当晚,旧金山的美国音乐厅举行了一场庆祝派对——或许是为了庆祝公钥密码技术将永存。
RSA, Diffie-Hellman, and other public key cryptosystems are not just ingenious ideas. They have evolved into commercial technologies and internet standards with great importance for businesses and individuals alike. The vast majority of the online transactions we perform every day could not be completed securely without public key cryptography. The RSA inventors patented their system in the 1970s, and their patent did not expire until late 2000. A celebratory party was held at the Great American Music Hall in San Francisco on the night the patent expired—a celebration, perhaps, of the fact that public key cryptography is here to stay.
1对于了解计算机数字系统的人来说,我这里指的是十进制数字,而不是二进制数字(位)。对于了解对数的人来说,从位到十进制数字的转换系数 30% 来自于 log10² ≈ 0.3。
1For those who know about computer number systems, I'm referring here to decimal digits, not binary digits (bits). For those who know about logarithms, the conversion factor of 30% for transforming from bits to decimal digits comes from the fact that log10 2 ≈ 0.3.
5
5
纠错码:自我修复的错误
Error-Correcting Codes: Mistakes That Fix Themselves
—约翰·洛克,《人类理解论》(1690年)
—JOHN LOCKE, Essay Concerning Human Understanding (1690)
如今,我们早已习惯了随时随地访问计算机。但理查德·汉明(Richard Hamming)却没那么幸运,他曾在20世纪40年代担任贝尔电话公司实验室的研究员:他需要的公司计算机被其他部门占用,只有周末才能用。因此,你可以想象,计算机读取自身数据时出错,导致系统反复崩溃,他当时有多么沮丧。以下是汉明本人对此的看法:
These days, we're used to accessing computers whenever we need them. Richard Hamming, a researcher working at the Bell Telephone Company labs in the 1940s, was not so lucky: the company computer he needed was used by other departments and available to him only on weekends. You can imagine his frustration, therefore, at the crashes that kept recurring due to the computer's errors in reading its own data. Here is what Hamming himself had to say on the matter:
连续两个周末,我回来后发现我所有的东西都被扔掉了,什么都没做。我真的很生气,也很恼火,因为我想要那些答案,却浪费了两个周末的时间。于是我心想:“该死,如果机器能检测到错误,为什么它不能找到错误的位置并纠正它呢?”
Two weekends in a row I came in and found that all my stuff had been dumped and nothing was done. I was really aroused and annoyed because I wanted those answers and two weekends had been lost. And so I said, “Dammit, if the machine can detect an error, why can't it locate the position of the error and correct it?”
很少有比这更清晰的例子来证明“需要是发明之母”。汉明很快就发明了第一个纠错码:一种看似神奇的算法,可以检测并纠正计算机数据中的错误。如果没有这些代码,我们的计算机和通信系统将比现在慢得多,性能也更差,可靠性也更低。
There can be few more clear-cut cases of necessity being the mother of invention. Hamming had soon created the first ever error-correcting code: a seemingly magical algorithm that detects and corrects errors in computer data. Without these codes, our computers and communication systems would be drastically slower, less powerful, and less reliable than they are today.
错误检测和纠正的必要性
THE NEED FOR ERROR DETECTION AND CORRECTION
计算机有三项基本功能。最重要的功能是执行计算。也就是说,给定一些输入数据,计算机必须以某种方式转换数据才能得出有用的答案。但是,如果没有计算机执行的另外两项非常重要的功能:存储数据和传输数据,计算答案的能力基本上毫无意义。(计算机主要将数据存储在内存和磁盘驱动器中。它们通常通过互联网传输数据。)为了强调这一点,想象一下一台既不能存储也不能传输信息的计算机。当然,它几乎毫无用处。没错,你可以进行一些复杂的计算(例如,准备一份详细的公司预算的复杂财务电子表格),但你将无法将结果发送给同事,甚至无法保存结果以便以后再处理。因此,数据传输和存储对于现代计算机来说至关重要。
Computers have three fundamental jobs. The most important job is to perform computations. That is, given some input data, the computer must transform the data in some way to produce a useful answer. But the ability to compute answers would be essentially useless without the other two very important jobs that computers perform: storing data and transmitting data. (Computers mostly store data in their memory and on disk drives. And they typically transmit data over the internet.) To emphasize this point, imagine a computer that could neither store nor transmit information. It would, of course, be almost useless. Yes, you could do some complex computations (for example, preparing an intricate financial spreadsheet detailing the budget for a company), but then you would be unable to send the results to a colleague or even to save the results so you could come back and work on them later. Therefore, transmission and storage of data are truly essential for modern computers.
但是,传输和存储数据面临着巨大的挑战:数据必须完全正确——因为在很多情况下,即使是一个微小的错误也可能导致数据失效。作为人类,我们也深知需要毫无错误地存储和传输信息。例如,如果您写下某人的电话号码,则必须正确记录每个数字并保持正确的顺序。如果其中一个数字出现错误,这个电话号码对您或其他任何人来说都可能毫无用处。在某些情况下,数据错误实际上比无用更糟糕。例如,存储计算机程序的文件中的错误可能会导致该程序崩溃或执行其非预期的操作。(它甚至可能会删除一些重要文件或在您有机会保存工作之前崩溃。)某些计算机财务记录中的错误可能会导致实际的金钱损失(例如,如果您以为自己购买的股票价格为每股 5.34 美元,但实际成本为 8.34 美元)。
But there is a huge challenge associated with transmitting and storing data: the data must be exactly right—because in many cases, even one tiny mistake can render the data useless. As humans, we are also familiar with the need to store and transmit information without any errors. For example, if you write down someone's phone number, it is essential that every digit is recorded correctly and in the right order. If there is even one mistake in one of the digits, the phone number is probably useless to you or anyone else. And in some cases, errors in data can actually be worse than useless. For example, an error in the file that stores a computer program can make that program crash or do things it was not intended to. (It might even delete some important files or crash before you get a chance to save your work.) And an error in some computerized financial records could result in actual monetary loss (if, say, you thought you were buying a stock priced at $5.34 per share but the actual cost was $8.34).
但是,作为人类,我们需要存储的无错误信息量相对较少,只要仔细检查重要的信息(例如银行账号、密码、电子邮件地址等),就不难避免错误。另一方面,计算机需要存储和传输的信息量却极其巨大。为了便于理解,请考虑以下情况。假设您有某种计算设备,其存储容量为 100 GB。(在撰写本文时,这是低成本笔记本电脑的典型容量。)这 100 GB 相当于大约 1500 万页文本。因此,即使这台计算机的存储系统每百万页只出错一次,当设备满负荷运转时,平均仍会有 15 个错误。同样的道理也适用于数据传输:如果你下载了一个 20MB 的软件程序,而你的计算机在接收到的每一百万个字符中只误解了一个字符,那么你下载的程序中可能仍然会有超过 20 个错误——每一个错误都可能在你最意想不到的时候导致代价高昂的崩溃。
But, as humans, the amount of error-free information we need to store is relatively small, and it's not too hard to avoid mistakes just by checking carefully whenever you know some information is important—things like bank account numbers, passwords, e-mail addresses, and the like. On the other hand, the amount of information that computers need to store and transmit without making any errors is absolutely immense. To get some idea of the scale, consider this. Suppose you have some kind of computing device with a storage capacity of 100 gigabytes. (At the time of writing, this is the typical capacity of a low-cost laptop.) This 100 gigabytes is equivalent to about 15 million pages of text. So even if this computer's storage system makes just one error per million pages, there would still be (on average) 15 mistakes on the device when filled to capacity. And the same lesson applies to transmission of data too: if you download a 20-megabyte software program, and your computer misinterprets just one in every million characters it receives, there will probably still be over 20 errors in your downloaded program—every one of which could cause a potentially costly crash when you least expect it.
这个故事的寓意是,对于计算机来说,99.9999% 的准确率远远不够。计算机必须能够存储和传输数十亿条信息,并且不犯任何错误。但计算机也必须像其他设备一样处理通信问题。电话就是一个很好的例子:很明显,它们并不能完美地传输信息,因为电话通话经常会受到失真、静电或其他噪音的影响。但电话并非唯一受到影响的设备:电线也会受到各种波动的影响;无线通信总是受到干扰;硬盘、CD 和 DVD 等物理介质也可能因为灰尘或其他物理干扰而被划伤、损坏,甚至误读。面对如此明显的通信错误,我们究竟如何才能将错误率控制在数十亿分之一以下呢?本章将揭示实现这一奇迹的精妙计算机科学背后的理念。事实证明,如果使用正确的技巧,即使是极不可靠的通信渠道也可以用来传输数据,并且错误率极低——实际上,错误率低到可以完全消除。
The moral of the story is that, for a computer, being accurate 99.9999% of the time is not even close to good enough. Computers must be able to store and transmit literally billions of pieces of information without making a single mistake. But computers have to deal with communication problems just like other devices. Telephones are a good example here: it's obvious that they don't transmit information perfectly, because phone conversations often suffer from distortions, static, or other types of noise. But telephones are not alone in their suffering: electrical wires are subject to all sorts of fluctuations; wireless communications suffer interference all the time; and physical media such as hard disks, CDs, and DVDs can be scratched, damaged, or simply misread because of dust or other physical interference. How on earth can we hope to achieve an error rate of less than one in many billions, in the face of such obvious communication errors? This chapter will reveal the ideas behind the ingenious computer science that makes this magic happen. It turns out that if you use the right tricks, even extremely unreliable communication channels can be used to transmit data with incredibly low error rates—so low that in practice, errors can be completely eliminated.
重复技巧
THE REPETITION TRICK
在不可靠的渠道上进行可靠通信的最基本技巧是我们都很熟悉的:为了确保某些信息传达正确,你只需重复几次即可。如果有人在电话连接不佳的情况下口述电话号码或银行账号,你很可能会要求他们至少重复一次,以确保没有错误。
The most fundamental trick for communicating reliably over an unreliable channel is one that we are all familiar with: to make sure that some information has been communicated correctly, you just need to repeat it a few times. If someone dictates a phone number or bank account number to you over a bad telephone connection, you will probably ask them to repeat it at least once to make sure there were no mistakes.
计算机也能做同样的事情。假设你银行的一台计算机正试图通过互联网向你传输账户余额。你的账户余额实际上是 5213.75 美元,但不幸的是,网络不太可靠,每个数字都有 20% 的可能性被改成其他数字。所以,第一次传输你的余额时,它可能是 5293.75 美元。显然,你无法知道这是否正确。所有数字可能都是正确的,但其中一个或多个数字可能是错误的,你无法判断。但通过使用重复技巧,你可以很好地猜测真实余额。想象一下,你请求传输你的余额五次,并收到以下响应:
Computers can do exactly the same thing. Let's suppose a computer at your bank is trying to transmit your account balance to you over the internet. Your account balance is actually $5213.75, but unfortunately the network is not very reliable and every single digit has a 20% chance being changed to something else. So the first time your balance is transmitted, it might arrive as $5293.75. Obviously, you have no way of knowing whether or not this is correct. All of the digits might be right, but one or more of them might be wrong and you have no way of telling. But by using the repetition trick, you can make a very good guess as to the true balance. Imagine that you ask for your balance to be transmitted five times, and receive the following responses:
请注意,有些传输中不止一个数字是错误的,甚至有一个传输(编号2)完全没有错误。关键在于你无法知道错误在哪里,所以你无法找出传输2是正确的。相反,你可以做的是分别检查每个数字,查看所有包含该数字的传输,然后找出出现频率最高的值。以下是结果,最常见的数字列在最后:
Notice that some of the transmissions have more than one digit wrong, and there's even one transmission (number 2) with no errors at all. The crucial point is that you have no way of knowing where the errors are, so there is no way you can pick out transmission 2 as being the correct one. Instead, what you can do is examine each digit separately, looking at all transmissions of that one digit, and pick the value that occurs most often. Here are the results again, with the most common digits listed at the end:
让我们看一些例子来清楚地说明这个想法。检查传输中的第一位数字,我们发现在传输 1-4 中,第一位数字是 5,而在传输 5 中,第一位数字是 7。换句话说,其中四次传输说的是“5”,只有一次说的是“7”。所以,虽然你不能完全确定,但你的银行余额第一位数字最有可能的值是 5。继续检查第二位数字,我们发现 2 出现了四次,而 4 只出现了一次,所以 2 是最有可能的第二位数字。第三位数字更有趣一些,因为有三种可能性:1 出现三次,9 出现一次,4 出现一次。但同样的原理适用,1 是最可能的真实值。通过对所有数字进行同样的操作,你可以得出你银行余额的最终猜测:5213.75 美元,在本例中这个猜测确实是正确的。
Let's look at some examples to make the idea absolutely clear. Examining the first digit in the transmission, we see that in transmissions 1-4, the first digit was a 5, whereas in transmission 5, the first digit was a 7. In other words, four of the transmissions said “5” and only one said “7.” So although you can't be absolutely sure, the most likely value for the first digit of your bank balance is 5. Moving on to the second digit, we see that 2 occurred four times, and 4 only once, so 2 is the most likely second digit. The third digit is a bit more interesting, because there are three possibilities: 1 occurs three times, 9 occurs once, and 4 occurs once. But the same principle applies, and 1 is the most likely true value. By doing this for all the digits, you can arrive at a final guess for your complete bank balance: $5213.75, which in this case is indeed correct.
嗯,这很简单。我们已经解决了问题吗?从某些方面来说,答案是肯定的。但你可能会因为两点而感到有些不满意。首先,这个通信信道的错误率只有20%,在某些情况下,计算机可能需要通过比这更差的信道进行通信。其次,也许更严重的是,在上面的例子中,最终答案恰好是正确的,但并不能保证答案永远正确:它只是一种猜测,基于我们认为最有可能是真实银行余额的数据。幸运的是,这两个反对意见都很容易解决:我们只需增加重传次数,直到可靠性达到我们想要的高度即可。
Well, that was easy. Have we solved the problem already? In some ways, the answer is yes. But you might be a little dissatisfied because of two things. First, the error rate for this communication channel was only 20%, and in some cases computers might need to communicate over channels that are much worse than that. Second, and perhaps more seriously, the final answer happened to be correct in the above example, but there is no guarantee that the answer will always be right: it is just a guess, based on what we think is most likely to be the true bank balance. Luckily, both of these objections can be addressed very easily: we just increase the number of retransmissions until the reliability is as high as we want.
例如,假设错误率为 50%,而不是上例中的 20%。那么,您可以要求银行传输您的余额 1000 次,而不仅仅是 5 次。我们只关注第一位数字,因为其他数字的计算方式相同。由于错误率为 50%,大约一半的数字会被正确传输为 5,而另一半将被更改为其他随机值。因此,5 大约会出现 500 次,而其他数字(0-4 和 6-9)每个数字只会出现 50 次左右。数学家可以计算出其他数字出现频率高于 5 的概率:事实证明,即使我们用这种方法每秒传输一次新的银行余额,我们也必须等待数万亿年才能预期对银行余额做出错误的猜测。这个故事的寓意是,通过足够频繁地重复一条不可靠的消息,您可以使其变得尽可能可靠。 (在这些例子中,我们假设错误是随机发生的。另一方面,如果恶意实体故意干扰传输并选择要创建的错误,则重复技巧就更容易受到攻击。后面介绍的一些代码甚至可以有效抵御这种类型的恶意攻击。)
For example, suppose the error rate was 50% instead of the 20% in the last example. Well, you could ask the bank to transmit your balance 1000 times instead of just 5. Let's concentrate on just the first digit, since the others work out the same way. Since the error rate is 50%, about half of them will be transmitted correctly, as a 5, and the other half will be changed to some other random values. So there will be about 500 occurrences of 5, and only about 50 each of the other digits (0-4 and 6-9). Mathematicians can calculate the chances of one of the other digits coming up more often than the 5: it turns out that even if we transmitted a new bank balance every second using this method, we would have to wait many trillions of years before we expect to make a wrong guess for the bank balance. The moral of the story is that by repeating an unreliable message often enough, you can make it as reliable as you want. (In these examples, we assumed the errors occur at random. If, on the other hand, a malicious entity is deliberately interfering with the transmission and choosing which errors to create, the repetition trick is much more vulnerable. Some of the codes introduced later work well even against this type of malicious attack.)
因此,通过使用重复技巧,可以解决不可靠通信的问题,并基本消除出错的可能性。不幸的是,重复技巧对于现代计算机系统来说还不够好。传输像银行余额这样的小数据时,重传1000次的成本并不算太高,但传输1000份大型软件下载(例如200兆字节)显然是不切实际的。显然,计算机需要使用比重复技巧更复杂的技术。
So, by using the repetition trick, the problem of unreliable communication can be solved, and the chance of a mistake essentially eliminated. Unfortunately, the repetition trick is not good enough for modern computer systems. When transmitting a small piece of data like a bank balance, it is not too costly to retransmit 1000 times, but it would obviously be completely impractical to transmit 1000 copies of a large (say, 200-megabyte) software download. Clearly, computers need to use something more sophisticated than the repetition trick.
冗余技巧
THE REDUNDANCY TRICK
尽管计算机并不像上面描述的那样使用重复技巧,但我们还是先介绍了它,以便了解可靠通信的最基本原理。这个基本原理是,你不能只发送原始消息;你需要发送一些额外的内容来提高可靠性。在重复技巧中,你发送的额外内容只是原始消息的更多副本。但事实证明,你可以发送许多其他类型的额外内容来提高可靠性。计算机科学家将这些额外的内容称为“冗余”。有时,冗余会被添加到原始消息中。我们将在讨论下一个技巧(校验和技巧)时看到这种“添加”技巧。但首先,我们将讨论另一种添加冗余的方法,它实际上将原始消息转换为更长的“冗余”消息——原始消息被删除,并替换为另一条更长的消息。当你收到更长的消息时,你可以将其转换回原始消息,即使它已被糟糕的通信信道损坏。我们简称此为冗余技巧。
Even though computers don't use the repetition trick as it was described above, we covered it first so that we could see the most basic principle of reliable communication in action. This basic principle is that you can't just send the original message; you need to send something extra to increase the reliability. In the case of the repetition trick, the extra thing you send is just more copies of the original message. But it turns out there are many other types of extra stuff you can send to improve the reliability. Computer scientists call the extra stuff “redundancy.” Sometimes, the redundancy is added on to the original message. We'll see this “adding on” technique when we look at the next trick (the checksum trick). But first, we will look at another way of adding redundancy, which actually transforms the original message into a longer “redundant” one—the original message is deleted and replaced by a different, longer one. When you receive the longer message, you can then transform it back into the original, even if it has been corrupted by a poor communication channel. We'll call this simply the redundancy trick.
举个例子就能说明这一点。回想一下,我们试图通过一个不可靠的通信渠道传输你的银行余额 5213.75 美元,该渠道随机更改了 20% 的数字。与其尝试只传输“5213.75 美元”,不如将其转换为一条包含相同信息的更长(因此也更“冗余”)的消息。在这种情况下,我们只需用英文拼写余额即可,如下所示:
An example will make this clear. Recall that we were trying to transmit your bank balance of $5213.75 over an unreliable communication channel that randomly altered 20% of the digits. Instead of trying to transmit just “$5213.75,” let's transform this into a longer (and therefore “redundant”) message that contains the same information. In this case, we'll simply spell out the balance in English words, like this:
五二一三点七五
five two one three point seven five
我们再次假设,由于通信渠道不畅,这条消息中大约 20% 的字符被随机翻转。这条消息最终可能看起来像这样:
Let's again suppose that about 20% of the characters in this message get flipped randomly to something else due to a poor communication channel. The message might end up looking something like this:
fiqe kwo 一个 thrxp 点 sivpn fivq
fiqe kwo one thrxp point sivpn fivq
尽管读起来有点烦人,但我想你会同意,任何懂英语的人都可以猜出这条损坏的信息代表的是真实的银行余额 5213.75 美元。
Although it's a little annoying to read, I think you will agree that anyone who knows English can guess that this corrupted message represents the true bank balance of $5213.75.
关键在于,由于我们使用了冗余消息,因此可以可靠地检测并纠正消息中的任何单个更改。如果我告诉你字符“fiqe”在英语中代表一个数字,并且只更改了一个字符,那么你完全可以肯定原始消息是“five”,因为没有其他英语数字可以通过只更改一个字符从“fiqe”得到。相反,如果我告诉你数字“367”代表一个数字,但其中一个数字被更改,那么你根本无法知道原始数字是什么,因为这条消息中没有冗余。
The key point is that because we used a redundant message, it is possible to reliably detect and correct any single change to the message. If I tell you that the characters “fiqe” represent a number in English and that only one character has been altered, you can be absolutely certain that the original message was “five,” because there is no other English number that can be obtained from “fiqe” by altering only one character. In stark contrast, if I tell you that the digits “367” represent a number but one of the digits has been altered, you have no way of knowing what the original number was, because there is no redundancy in this message.
虽然我们还没有探究冗余的具体工作原理,但我们已经知道它与使消息更长有关,并且消息的每个部分都应该符合某种已知的模式。这样,任何单个更改都可以首先被识别出来(因为它不符合已知的模式),然后进行纠正(通过修改错误以使其符合模式)。
Although we haven't yet explored exactly how redundancy works, we have already seen that it has something to do with making the message longer, and that each part of the message should conform to some kind of well-known pattern. In this way, any single change can be first identified (because it does not fit in with a known pattern) and then corrected (by changing the error to fit with the pattern).?
计算机科学家将这些已知模式称为“代码字”。在我们的例子中,代码字只是用英语书写的数字,例如“一”、“二”、“三”等等。
Computer scientists call these known patterns “code words.” In our example, the code words are just numbers written in English, like “one,” “two,” “three,” and so on.
现在是时候解释一下冗余技巧的具体工作原理了。信息是由计算机科学家所说的“符号”组成的。在我们这个简单的例子中,符号是数字 0-9(为了更容易理解,我们将忽略美元符号和小数点)。每个符号都被分配了一个代码字。在我们的例子中,符号 1 被分配了代码字“一”,符号 2 被分配了代码字“二”,依此类推。
Now it's time to explain exactly how the redundancy trick works. Messages are made up of what computer scientists call “symbols.” In our simple example, the symbols are the numeric digits 0-9 (we'll ignore the dollar sign and decimal point to make things even easier). Each symbol is assigned a code word. In our example, the symbol 1 is assigned the code word “one,” 2 is assigned “two,” and so on.
要传输一条消息,首先要将每个符号转换成对应的代码字。然后,通过不可靠的通信信道发送转换后的消息。收到消息后,您需要检查消息的每个部分,并检查其是否为有效的代码字。如果有效(例如,“five”),则将其转换回对应的符号(例如 5)。如果无效(例如,“fiqe”),则找出与其最匹配的代码字(在本例中为“five”),并将其转换为对应的符号(在本例中为 5)。上图展示了使用此代码的示例。
To transmit a message, you first take each symbol and translate it into its corresponding code word. Then you send the transformed message over the unreliable communication channel. When the message is received, you look at each part of a message and check whether it is a valid code word. If it is valid (e.g., “five”), you just transform it back into the corresponding symbol (e.g., 5). If it is not a valid code word (e.g., “fiqe”), you find out which code word it matches most closely (in this case, “five”), and transform that into the corresponding symbol (in this case, 5). Examples of using this code are shown in the figure above.
使用英语单词表示数字的代码。
A code using English words for digits.
这就是全部内容。实际上,计算机一直在使用这种冗余技巧来存储和传输信息。数学家们已经设计出比我们之前用作示例的英语代码更复杂的代码字,但除此之外,可靠的计算机通信的工作原理是一样的。下一页的图表给出了一个真实的例子。这就是计算机科学家称之为 (7,4) 汉明码的代码,它是理查德·汉明于 1947 年在贝尔实验室为应对前面描述的周末计算机崩溃而发现的代码之一。(由于贝尔实验室要求汉明申请这些代码的专利,他直到三年后的 1950 年才将其发布。)与我们之前的代码最明显的区别在于,所有操作都基于 0 和 1。由于计算机存储或传输的每一条数据都会被转换成由 0 和 1 组成的字符串,因此实际使用的任何代码都仅限于这两位数字。
That's really all there is to it. Computers actually use this redundancy trick all the time to store and transmit information. Mathematicians have worked out fancier codewords than the English-language ones we were using as an example, but otherwise the workings of reliable computer communication are the same. The figure on the facing page gives a real example. This is the code computer scientists call the (7,4) Hamming code, and it is one of the codes discovered by Richard Hamming at Bell Labs in 1947, in response to the weekend computer crashes described earlier. (Because of Bell's requirement that he patent the codes, Hamming did not publish them until three years later, in 1950.) The most obvious difference to our previous code is that everything is done in terms of zeros and ones. Because every piece of data stored or transmitted by a computer is converted into strings of zeros and ones, any code used in real life is restricted to just these two digits.
计算机使用的区域码。计算机科学家将此代码称为 (7,4) 汉明码。请注意,“编码”框仅列出了 16 种可能的 4 位数字输入中的 5 种。其余输入也有相应的码字,但此处省略。
Areal code used by computers. Computer scientists call this code the (7,4) Hamming code. Note that the “Encoding” box lists only five of the 16 possible 4-digit inputs. The remaining inputs also have corresponding code words, but they are omitted here.
除此之外,一切运作方式与之前完全相同。编码时,每组四位数字都会添加冗余,从而生成一个七位数字的代码字。解码时,首先要查找与收到的七位数字完全匹配的数字,如果找不到,则取最接近的匹配。你可能会担心,现在我们只处理 1 和 0,可能会有多个同样接近的匹配,最终导致你选择错误的解码方式。然而,这种特殊的编码设计巧妙,七位数字代码字中的任何单个错误都可以被明确纠正。具有这种特性的代码设计背后有一些精美的数学原理,但我们在此不再赘述。
But apart from that, everything works exactly the same as before. When encoding, each group of four digits has redundancy added to it, generating a code word of seven digits. When decoding, you first look for an exact match for the seven digits you received, and if that fails, you take the closest match. You might be worried that, now we are working with only ones and zeros, there might be more than one equally close match and you could end up choosing the wrong decoding. However, this particular code has been designed cunningly so that any single error in a 7-digit codeword can be corrected unambiguously. There is some beautiful mathematics behind the design of codes with this property, but we won't be pursuing the details here.
值得强调的是,为什么在实践中冗余技巧比重复技巧更受欢迎。主要原因是这两个技巧的相对成本。计算机科学家用“开销”来衡量纠错系统的成本。开销是指为了确保消息被正确接收而需要发送的额外信息量。重复技巧的开销非常大,因为你必须发送消息的多个完整副本。冗余技巧的开销取决于你使用的具体代码字集。在上面使用英语单词的例子中,冗余消息长度为35个字符,而原始消息仅包含6个数字,因此这种冗余技巧的特定应用的开销也相当大。但是,数学家们已经设计出冗余度低得多的代码字集,但在错误未被检测到的概率方面仍然具有极高的性能。这些代码字的低开销正是计算机使用冗余技巧而不是重复技巧的原因。
It's worth emphasizing why the redundancy trick is preferred to the repetition trick in practice. The main reason is the relative cost of the two tricks. Computer scientists measure the cost of error-correction systems in terms of “overhead.” Overhead is just the amount of extra information that needs to be sent to make sure a message is received correctly. The overhead of the repetition trick is enormous, since you have to send several entire copies of the message. The overhead of the redundancy trick depends on the exact set of code words that you use. In the example above that used English words, the redundant message was 35 characters long, whereas the original message consisted of only 6 numeric digits, so the overhead of this particular application of the redundancy trick is also quite large. But mathematicians have worked out sets of code words that have much lower redundancy, yet still get incredibly high performance in terms of the chance of an error going undetected. The low overhead of these code words is the reason that computers use the redundancy trick instead of the repetition trick.
到目前为止的讨论都使用了代码传输信息的例子,但我们讨论的所有内容同样适用于存储信息。CD、DVD 和计算机硬盘都高度依赖纠错码来实现我们在实践中观察到的卓越可靠性。
The discussion so far has used examples of transmitting information using codes, but everything we have discussed applies equally well to the task of storing information. CDs, DVDs, and computer hard drives all rely heavily on error-correcting codes to achieve the superb reliability we observe in practice.
校验和技巧
THE CHECKSUM TRICK
到目前为止,我们已经研究了同时检测和纠正数据错误的方法。重复技巧和冗余技巧都是实现这一目标的方法。但还有另一种可能的解决方法:我们可以忘记纠正错误,只专注于检测错误。(17 世纪的哲学家约翰·洛克清楚地意识到了错误检测和错误纠正之间的区别——正如您从本章开头的引文中看到的那样。)对于许多应用程序来说,仅仅检测错误就足够了,因为如果检测到错误,您只需请求另一份数据副本即可。您可以继续请求新的副本,直到获得一份没有错误的副本。这是一种非常常用的策略。例如,几乎所有互联网连接都使用这种技术。我们将其称为“校验和技巧”,其原因很快就会揭晓。
So far, we've looked at ways to simultaneously detect and correct errors in data. The repetition trick and the redundancy trick are both ways of doing this. But there's another possible approach to this whole problem: we can forget about correcting errors and concentrate only on detecting them. (The 17th-century philosopher John Locke was clearly aware of the distinction between error detection and error correction—as you can see from the opening quotation of this chapter.) For many applications, merely detecting an error is sufficient, because if you detect an error, you just request another copy of the data. And you can keep on requesting new copies, until you get one that has no errors in it. This is a very frequently used strategy. For example, almost all internet connections use this technique. We'll call it the “checksum trick,” for reasons that will become clear very soon.
为了理解校验和技巧,假设所有消息都仅由数字组成会更方便。这是一个非常现实的假设,因为计算机以数字的形式存储所有信息,并且只有在将信息呈现给人类时才将数字转换为文本或图像。但无论如何,重要的是要理解,任何特定的消息符号选择都不会影响本章描述的技术。有时使用数字符号(数字 0-9)更方便,有时使用字母符号(字符 az)更方便。但无论哪种情况,我们都可以就这些符号集之间的某种转换达成一致。例如,从字母符号到数字符号的一个显而易见的转换是 a —> 01,b —> 02,…,z —> 26。因此,我们研究传输数字消息还是字母消息的技术并不重要;该技术稍后可以通过先对符号进行简单、固定的转换来应用于任何类型的消息。
To understand the checksum trick, it will be more convenient to pretend that all of our messages consist of numbers only. This is a very realistic assumption, since computers store all information in the form of numbers and only translate the numbers into text or images when presenting that information to humans. But, in any case, it is important to understand that any particular choice of symbols for the messages does not affect the techniques described in this chapter. Sometimes it is more convenient to use numeric symbols (the digits 0-9), and sometimes it is more convenient to use alphabetic symbols (the characters a-z). But in either case, we can agree on some translation between these sets of symbols. For example, one obvious translation from alphabetic to numeric symbols would be a —> 01, b —> 02,…, z —> 26. So it really doesn't matter whether we investigate a technique for transmitting numeric messages or alphabetic messages; the technique can later be applied to any type of message by first doing a simple, fixed translation of the symbols.
现在,我们必须了解校验和究竟是什么。校验和有很多不同的类型,但目前我们先讨论最不复杂的类型,我们称之为“简单校验和”。
At this point, we have to learn what a checksum actually is. There are many different types of checksums, but for now we will stick with the least complicated type, which we'll call a “simple checksum.”
计算数字消息的简单校验和非常非常简单:只需取出消息中的数字,将它们全部相加,然后丢弃结果中除最后一位之外的所有数字,剩下的数字就是简单的校验和。以下是一个例子:假设消息是
Computing the simple checksum of a numeric message is really, really easy: you just take the digits of the message, add them all up, throw away everything in the result except for the last digit, and the remaining digit is your simple checksum. Here's an example: suppose the message is
4 6 7 5 6
4 6 7 5 6
那么所有数字的总和是 4 + 6 + 7 + 5 + 6 = 28,但我们只保留最后一位数字,因此该消息的简单校验和为 8。
Then the sum all the digits is 4 + 6 + 7 + 5 + 6 = 28, but we keep only the last digit, so the simple checksum of this message is 8.
但是校验和是如何使用的呢?很简单:只需在发送消息之前,将原始消息的校验和附加到消息末尾即可。然后,当其他人收到该消息时,他们可以再次计算校验和,并将其与您发送的校验和进行比较,看看是否正确。换句话说,他们“检查”了消息的“和”,因此有了“校验和”这个术语。我们继续上面的例子。消息“46756”的简单校验和为 8,因此我们将该消息及其校验和传输为
But how are checksums used? That's easy: you just append the checksum of your original message to the end of the message before you send it. Then, when the message is received by someone else, they can calculate the checksum again, compare it with the one you sent, and see if it is correct. In other words, they “check” the “sum” of the message—hence the terminology “checksum.” Let's stick with the above example. The simple checksum of the message “46756” is 8, so we transmit the message and its checksum as
4 6 7 5 6 8
4 6 7 5 6 8
现在,接收消息的人必须知道你使用了校验和技巧。假设他们知道,他们就能立即识别出最后一位数字 8 不是原始消息的一部分,所以他们会把它放在一边,计算其余所有内容的校验和。如果消息传输过程中没有错误,他们会计算 4 + 6 + 7 + 5 + 6 = 28,保留最后一位数字(即 8),检查它是否等于他们之前放在一边的校验和(结果确实如此),从而得出结论,消息传输正确。另一方面,如果消息传输过程中出现错误,会发生什么情况?假设 7 被随机更改为 3。那么你将收到消息
Now, the person receiving the message has to know that you are using the checksum trick. Assuming they do know, they can immediately recognize that the last digit, the 8, is not part of the original message, so they put it to one side and compute the checksum of everything else. If there were no errors in the transmission of the message, they will compute 4 + 6 + 7 + 5 + 6 = 28, keep the last digit (which is 8), check that it is equal to the checksum they put aside earlier (which it is), and therefore conclude that the message was transmitted correctly. On the other hand, what happens if there was an error in transmitting the message? Suppose the 7 was randomly changed to a 3. Then you would receive the message
4 6 3 5 6 8
4 6 3 5 6 8
您需要将 8 留作后续比较,并将校验和计算为 4 + 6 + 3 + 5 + 6 = 24,只保留最后一位数字 (4)。这与之前留出的 8不相等,因此您可以确定该消息在传输过程中已损坏。此时,您可以请求重新传输该消息,等待收到新的副本,然后再次计算并比较校验和。您可以重复此操作,直到收到校验和正确的消息。
You would set aside the 8 for later comparison and compute the checksum as 4 + 6 + 3 + 5 + 6 = 24, keeping only the last digit (4). This is not equal to the 8 that was set aside earlier, so you would be sure that the message was corrupted during transmission. At this point, you request that the message is retransmitted, wait until you receive a new copy, then again compute and compare the checksum. And you can keep doing this until you get a message whose checksum is correct.
这一切似乎好得令人难以置信。回想一下,纠错系统的“开销”是指除了消息本身之外,你还需要发送的额外信息量。好吧,我们似乎拥有了终极低开销系统,因为无论消息有多长,我们只需添加一位额外的数字(校验和)就能检测到错误!
All of this seems almost too good to be true. Recall that the “overhead” of an error-correcting system is the amount of extra information you have to send in addition to the message itself. Well, here we seem to have the ultimate low-overhead system, since no matter how long the message is, we only have to add one extra digit (the checksum) to detect an error!
唉,事实证明,这种简单校验和系统好得令人难以置信。问题在于:上面描述的简单校验和最多只能检测到消息中的一个错误。如果存在两个或更多错误,简单校验和或许能检测到,但也可能无法检测到。让我们来看一些例子:
Alas, it turns out that this system of simple checksums is too good to be true. Here is the problem: the simple checksum described above can detect at most one error in the message. If there are two or more errors, the simple checksum might detect them, but then again it might not. Let's look at some examples of this:
原始消息(46756)与之前相同,其校验和(8)也相同。下一行消息有一个错误(第一位数字是 1 而不是 4),校验和结果为 5。事实上,你或许可以说服自己,更改任何一位数字都会导致校验和不同于 8,因此你一定能检测到消息中的任何一个错误。不难证明这始终成立:如果只有一个错误,那么简单的校验和绝对可以检测到它。
The original message (46756) is the same as before, and so is its checksum (8). In the next line is a message with one error (the first digit is a 1 instead of a 4), and the checksum turns out to be 5. In fact, you can probably convince yourself that changing any single digit results in a checksum that differs from 8, so you are guaranteed to detect any single mistake in the message. It's not hard to prove that this is always true: if there is only one error, a simple checksum is absolutely guaranteed to detect it.
在表格的下一行,我们看到一条包含两个错误的消息:前两位数字均被篡改。在这种情况下,校验和恰好是 4。由于 4 与原始校验和(8)不同,因此接收此消息的人实际上会检测到错误。然而,关键在于表格的最后一行。这是另一条包含两个错误的消息,同样在前两位数字中。但值不同,而且这条包含两个错误的消息的校验和恰好是 8——与原始校验和相同!因此,接收此消息的人无法检测到消息中的错误。
In the next line of the table, we see a message with two errors: each of the first two digits has been altered. In this case, the checksum happens to be 4. And since 4 is different from the original checksum, which was 8, the person receiving this message would in fact detect that an error had been made. However, the crunch comes in the last line of the table. Here is another message with two errors, again in the first two digits. But the values are different, and it so happens that the checksum for this two-error message is 8—the same as the original! So a person receiving this message would fail to detect that there are errors in the message.
幸运的是,我们可以通过对校验和技巧进行一些调整来解决这个问题。第一步是定义一种新的校验和。我们称之为“楼梯”校验和,因为它有助于在计算时想象爬楼梯。想象一下,你站在楼梯的底部,楼梯的编号分别为 1、2、3,依此类推。要计算楼梯校验和,你需要像以前一样将数字相加,但每个数字都会乘以你所在的楼梯编号,并且你必须为每位数字向上移动一级。最后,你需要丢弃除最后一位数字之外的所有数字,就像简单的校验和一样。因此,如果消息是
Luckily, it turns out that we can get around this problem by adding a few more tweaks to our checksum trick. The first step is to define a new type of checksum. Let's call it a “staircase” checksum because it helps to think of climbing a staircase while computing it. Imagine you are at the bottom of a staircase with the stairs numbered 1,2,3, and so on. To calculate a staircase checksum, you add up the digits just like before, but each digit gets multiplied by the number of the stair you are on, and you have to move up one step for each digit. At the end, you throw away everything except the last digit, just as with the simple checksum. So if the message is
4 6 7 5 6
4 6 7 5 6
像以前一样,然后通过首先计算阶梯和来计算阶梯校验和
like before, then the staircase checksum is calculated by first calculating the staircase sum
然后丢弃除最后一位数字 7 之外的所有内容。因此“46756”的阶梯校验和为 7。
Then throw away everything except the last digit, which is 7. So the staircase checksum of “46756” is 7.
这一切的意义何在?事实证明,如果同时包含简单校验和和阶梯校验和,就能保证检测到任何消息中的任意两个错误。因此,我们新的校验和技巧是先传输原始消息,然后再传输两个额外的数字:先传输简单校验和,然后传输阶梯校验和。例如,消息“46756”现在将被传输为
What is the point of all this? Well, it turns out that if you include both the simple and staircase checksums, then you are guaranteed to detect any two errors in any message. So our new checksum trick is to transmit the original message, then two extra digits: the simple checksum first, then the staircase checksum. For example, the message “46756” would now be transmitted as
4 6 7 5 6 8 7
4 6 7 5 6 8 7
收到消息后,你再次必须事先确认究竟使用了什么技巧。但假设你知道,像简单的校验和技巧一样,检查错误也很容易。在这种情况下,你首先要留出最后两位数字(8,即简单校验和,以及 7,即阶梯校验和)。然后计算消息其余部分的简单校验和(46756,等于 8),并计算阶梯校验和(等于 7)。如果计算出的两个校验和值都与发送的值匹配(在本例中是匹配的),则可以保证该消息要么正确,要么至少有三个错误。
When you receive the message, you again have to know by prior agreement exactly what trick has been applied. But assuming you do know, it is easy to check for errors just as with the simple checksum trick. In this case you first set aside the last two digits (the 8, which is the simple checksum, and the 7, which is the staircase checksum). You then compute the simple checksum of the rest of the message (46756, which comes to 8), and you compute the staircase checksum too (which comes to 7). If both the computed checksum values match the ones that were sent (and in this case they do), you are guaranteed that the message is either correct, or has at least three errors.
下表实际展示了这种情况。它与上表完全相同,只是在每一行都添加了阶梯校验和,并添加了一行作为示例。当出现一个错误时,我们发现简单校验和和阶梯校验和都与原始消息不同(5 而不是 8,4 而不是 7)。当出现两个错误时,两个校验和可能都不同,例如在表格的第三行中,我们看到 4 而不是 8,2 而不是 7。但正如我们已经发现的,有时在出现两个错误时,简单校验和不会改变。第四行显示了一个示例,其中简单校验和仍然为 8。但由于阶梯校验和与原始校验和不同(9 而不是 7),我们仍然知道该消息有错误。在最后一行,我们看到反过来也成立:这里有一个包含两个错误的示例,这两个错误导致简单校验和不同(9 而不是 8),但阶梯校验和相同(7)。但同样,关键在于我们仍然可以检测到错误,因为两个校验和中至少有一个与原始校验和不同。虽然证明这一点需要一些技术性的数学知识,但这并非偶然:事实证明,只要错误不超过两个,你总是能够检测到的。
The next table shows this in practice. It is identical to the previous table except that the staircase checksum has been added to each row, and a new row has been added as an extra example. When there is one error, we find that both the simple and staircase checksums differ from the original message (5 instead of 8, and 4 instead of 7). When there are two errors, it is possible for both checksums to differ, as in the third row of the table where we see 4 instead of 8, and 2 instead of 7. But as we already found out, sometimes the simple checksum will not change when there are two errors. The fourth row shows an example, where the simple checksum is still 8. But because the staircase checksum differs from the original (9 instead of 7), we still know that this message has errors. And in the last row, we see that it can work out the other way around too: here is an example of two errors that results in a different simple checksum (9 instead of 8), but the same staircase checksum (7). But, again, the point is that we can still detect the error because at least one of the two checksums differs from the original. And although it would take some slightly technical math to prove it, this is no accident: it turns out that you will always be able to detect the errors if there are no more than two of them.
现在我们已经掌握了基本方法,需要注意的是,刚才描述的校验和技巧只能保证对相对较短的消息(少于 10 位)有效。但类似的思路也适用于较长的消息。可以通过一些简单的操作序列来定义校验和,例如将数字相加、将数字乘以各种形状的“阶梯”,以及按照固定模式交换部分数字。虽然这听起来可能很复杂,但计算机可以以极快的速度计算这些校验和,事实证明,这是一种非常有用且实用的消息错误检测方法。
Now that we have a grasp of the fundamental approach, we need to be aware that the checksum trick just described is guaranteed to work only for relatively short messages (fewer than 10 digits). But very similar ideas can be applied to longer messages. It is possible to define checksums by certain sequences of simple operations like adding up the digits, multiplying the digits by “staircases” of various shapes, and swapping some of the digits around according to a fixed pattern. Although that might sound complicated, computers can compute these checksums blindingly fast and it turns out to be an extremely useful, practical way of detecting errors in a message.
上面描述的校验和技巧只产生两位校验和数字(简单数字和阶梯数字),但真正的校验和通常会产生更多数字,有时甚至多达 150 位。(在本章的其余部分,我指的是十个十进制数字 0-9,而不是计算机通信中更常用的两个二进制数字 0 和 1。)重点是,校验和中的数字位数(无论是像上面的例子那样是 2 位,还是像一些实际使用的校验和那样是 150 位左右)是固定的。但是,尽管任何给定校验和算法产生的校验和的长度都是固定的,但您可以计算任意长的消息的校验和。因此,对于非常长的消息,即使是像 150 位这样的相对较大的校验和,最终相对于消息本身来说也是微不足道的。例如,假设你使用 100 位校验和来验证从网上下载的 20MB 软件包的正确性。该校验和不到软件包大小的千分之一。我相信你会同意这是可以接受的开销!而且数学家会告诉你,使用这种长度的校验和时,检测不出错误的概率极其微小,实际上根本不可能。
The checksum trick described above produces only two checksum digits (the simple digit and the staircase digit), but real checksums usually produce many more digits than that—sometimes as many as 150 digits. (Throughout the remainder of this chapter, I am talking about the ten decimal digits, 0-9, not the two binary digits, 0 and 1, which are more commonly used in computer communication.) The important point is that the number of digits in the checksum (whether 2, as in the example above, or about 150, as for some checksums used in practice) is fixed. But although the length of the checksums produced by any given checksum algorithm is fixed, you can compute checksums of messages that are as long as you want. So for very long messages, even a relatively large checksum like 150 digits ends up being minuscule in proportion to the message itself. For example, suppose you use a 100-digit checksum to verify the correctness of a 20-megabyte software package downloaded from the web. The checksum is less than one-thousandth of 1% of the size of the software package. I'm sure you would agree this is an acceptable level of overhead! And a mathematician will tell you that the chance of failing to detect an error when using a checksum of this length is so incredibly tiny that it is for all practical purposes impossible.
像往常一样,这里有几个重要的技术细节。任何100位校验和系统都拥有如此高的抗故障能力的说法都是错误的。它需要一种计算机科学家称之为加密哈希函数的校验和——尤其是在消息的更改可能由恶意对手而非不良通信信道的随机变化造成的情况下。这是一个非常现实的问题,因为邪恶的黑客可能会试图修改那个20兆字节的软件包,使其具有相同的100位校验和,但实际上是另一个可以控制您计算机的软件!使用加密哈希函数可以消除这种可能性。
As usual, there are a few important technical details here. It's not true that any 100-digit checksum system has this incredibly high resistance to failure. It requires a certain type of checksum that computer scientists call a cryptographic hash function—especially if the changes to the message might be made by a malicious opponent, instead of the random vagaries of a poor communication channel. This is a very real issue, because it is possible that an evil hacker might try to alter that 20-megabyte software package in such a way that it has the same 100-digit checksum, but is actually a different piece of software that will take control of your computer! The use of cryptographic hash functions eliminates this possibility.
精准技巧
THE PINPOINT TRICK
既然我们已经了解了校验和,就可以回到最初的问题:检测和纠正通信错误。我们已经知道如何做到这一点,要么使用低效的重复技巧,要么使用高效的冗余技巧。但现在让我们回到这个问题,因为我们从未真正找到如何创建构成这个技巧关键要素的代码字。我们确实有过用英语单词描述数字的例子,但这组特定的代码字比计算机实际使用的效率更低。我们也看到了汉明码的真实示例,但没有任何解释这些代码字最初是如何生成的。
Now that we know about checksums, we can go back to the original problem of both detecting and correcting communication errors. We already know how to do this, either inefficiently using the repetition trick or efficiently using the redundancy trick. But let's return to this now, because we never really found out how to create the code words that form the key ingredient in this trick. We did have the example of using English words to describe numerals, but this particular set of code words is less efficient than the ones computers actually use. And we also saw the real example of a Hamming code, but without any explanation of how the code words were produced in the first place.
现在我们将学习另一组可能用于执行冗余技巧的代码字。由于这是冗余技巧的一个非常特殊的案例,可以让你快速找出错误,我们将其称为“精确技巧”。
So now we will learn about another possible set of code words that can be used to perform the redundancy trick. Because this is a very special case of the redundancy trick that allows you to quickly pinpoint an error, we'll call this the “pinpoint trick.”
就像我们之前提到的校验和技巧一样,我们将完全处理由数字 0-9 组成的数字消息,但请记住,这只是为了方便。将字母消息转换为数字非常简单,因此这里描述的技术可以应用于任何消息。
Just as we did with the checksum trick, we will work entirely with numerical messages consisting of the digits 0-9, but you should keep in mind that this is just for convenience. It is very simple to take an alphabetical message and translate it into numbers, so the technique described here can be applied to any message whatsoever.
为了简单起见,我们假设消息长度恰好为 16 位,但这不会限制该技术的实际应用。如果消息很长,只需将其拆分成 16 位的块,然后分别处理每个块即可。如果消息长度少于 16 位,则用零填充,直到长度达到 16 位。
To keep things simple, we'll assume that the message is exactly 16 digits long, but, again, this doesn't limit the technique in practice. If you have a long message, just break it into 16-digit chunks and work with each chunk separately. If the message is shorter than 16 digits, fill it up with zeroes, until it is 16 digits long.
精确定位技巧的第一步是将消息中的 16 位数字重新排列成一个正方形,从左到右,从上到下阅读。因此,如果实际消息是
The first step in the pinpoint trick is to rearrange the 16 digits of the message into a square that reads left to right, top to bottom. So if the actual message is
4837543622563997
4837543622563997
它被重新排列成
it gets rearranged into
接下来,我们计算每一行的简单校验和并将其添加到该行的右侧:
Next, we compute a simple checksum of each row and add it to the right-hand side of the row:
这些简单校验和的计算方法与之前类似。例如,要获取第二行的校验和,请计算 5 + 4 + 3 + 6 = 18,然后取最后一位数字 8。
These simple checksums are computed just like before. For example, to get the second row checksum you compute 5 + 4 + 3 + 6 = 18 and then take the last digit, which is 8.
精确定位技巧的下一步是计算每一列的简单校验和,并将它们添加到底部的新行中:
The next step in the pinpoint trick is to compute simple checksums for each column and add these in a new row at the bottom:
再说一遍,简单的校验和没什么神秘的。例如,第三列的计算公式是 3 + 3 + 5 + 9 = 20,取最后一位数字后就变成了 0。
Again, there's nothing mysterious about the simple checksums. For example, the third column is computed from 3 + 3 + 5 + 9 = 20, which becomes 0 when we take the last digit.
精确定位技巧的下一步是重新排序所有内容,以便每次存储或传输一位数字。方法很简单,从左到右、从上到下读取数字。最终我们得到了以下 24 位数字的消息:
The next step in the pinpoint trick is to reorder everything so it can be stored or transmitted one digit at a time. You do this in the obvious way, reading digits from left to right, top to bottom. So we end up with the following 24-digit message:
483725436822565399784306
483725436822565399784306
现在假设你收到了一条用精确定位技巧传输的消息。你需要按照哪些步骤来解析原始消息并纠正任何通信错误?让我们来看一个例子。原始的16位数字消息与上面的消息相同,但为了更有趣,假设发生了通信错误,其中一个数字被篡改了。现在先不用担心哪个数字是被篡改的——我们很快就会用精确定位技巧来判断。
Now imagine you have received a message that has been transmitted using the pinpoint trick. What steps do you follow to work out the original message and correct any communication errors? Let's work through an example. The original 16-digit message will be the same as the one above, but to make things interesting, suppose there was a communication error and one of the digits was altered. Don't worry about which is the altered digit yet—we will be using the pinpoint trick to determine that very shortly.
假设你收到的24 位数字消息是
So let's suppose the 24-digit message you received is
483725436827565399784306
483725436827565399784306
您的第一步是将数字排列在一个 5×5 的正方形中,并确认最后一列和最后一行对应于与原始消息一起发送的校验和数字:
Your first step will be to lay the digits out in a 5-by-5 square, recognizing that the last column and last row correspond to checksum digits that were sent with the original message:
接下来,计算每行和每列的前四位数字的简单校验和,并将结果记录在收到的校验和值旁边的新创建的行和列中:
Next, compute simple checksums of the first four digits in each row and column, recording the results in a newly created row and column next to the checksum values that you received:
务必牢记,这里有两组校验和值:一组是发送给您的,另一组是您计算出来的。通常情况下,这两组值会相同。事实上,如果它们完全相同,您可以断定该消息很可能是正确的。但如果存在通信错误,则部分计算出来的校验和值会与发送的值不同。请注意,在当前示例中,存在两处这样的差异:第三行中的 5 和 0 不同,第二列中的 3 和 8 也不同。有问题的校验和已在方框中突出显示:
It is crucial to bear in mind that there are two sets of checksum values here: the ones you were sent, and the ones you calculated. Mostly, the two sets of values will be the same. In fact, if they are all identical, you can conclude that the message is very likely correct. But if there was a communication error, some of the calculated checksum values will differ from the sent ones. Notice that in the current example, there are two such differences: the 5 and 0 in the third row differ, and so do the 3 and 8 in the second column. The offending checksums are highlighted in boxes:
关键在于:这些差异的位置准确地告诉你通信错误发生在哪里!它一定在第三行(因为每隔一行都有正确的校验和),也一定在第二列(因为每隔一列都有正确的校验和)。从下图可以看出,这将范围缩小到只有一种可能性——用实心框突出显示的“7”:
Here is the key insight: the location of these differences tells you exactly where the communication error occurred! It must be in the third row (because every other row had the correct checksum), and it must be in the second column (because every other column had the correct checksum). And as you can see from the following diagram, this narrows it down to exactly one possibility—the 7 highlighted in a solid box:
但这还不是全部——我们找到了错误,但还没有纠正它。幸运的是,这很容易:我们只需将错误的 7 替换为一个能让两个校验和都正确的数字即可。我们可以看到,第二列的校验和原本应该是 3,但结果却是 8——换句话说,校验和需要减少 5。因此,让我们将错误的 7 减少 5,得到 2:
But that's not all—we have located the error, but not yet corrected it. Fortunately, this is easy: we just have to replace the erroneous 7 with a number that will make both of the checksums correct. We can see that the second column was meant to have a checksum of 3, but it came out to 8 instead—in other words, the checksum needs to be reduced by 5. So let's reduce the erroneous 7 by 5, which leaves 2:
你甚至可以通过检查第三行来再次确认这一更改——现在它的校验和为 5,与收到的校验和一致。错误已被找到并纠正!最后一个显而易见的步骤是从 5×5 的方格中提取出更正后的原始 16 位数字消息,方法是从上到下、从左到右读取(当然,忽略最后一行和最后一列)。这样可以得到
You can even double-check this change, by examining the third row—it now has a checksum of 5, which agrees with the received checksum. The error has been both located and corrected! The final obvious step is to extract the corrected original 16-digit message from the 5-by-5 square, by reading top to bottom, left to right (and ignoring the final row and column, of course). This gives
4837543622563997
4837543622563997
这实际上与我们一开始所传达的信息是一样的。
which really is the same message that we started with.
在计算机科学中,这种精确的技巧被称为“二维奇偶校验”。当使用计算机通常使用的二进制数时, “奇偶校验”的含义与简单的校验和相同。由于消息以二维(行和列)的网格形式排列,因此这种奇偶校验被称为二维奇偶校验。二维奇偶校验已在一些实际的计算机系统中使用,但它不如其他某些冗余技巧有效。我选择在这里解释它,是因为它非常直观,并且能够传达出如何在无需当今计算机系统中流行的代码背后复杂的数学知识的情况下查找和纠正错误。
In computer science, the pinpoint trick goes by the name of “two-dimensional parity.” The word parity means the same thing as a simple checksum, when working with the binary numbers computers normally use. And the parity is described as two-dimensional because the message gets laid out in a grid with two dimensions (rows and columns). Two-dimensional parity has been used in some real computer systems, but it is not as effective as certain other redundancy tricks. I chose to explain it here because it is very easy to visualize and conveys the flavor of how one can both find and correct errors without requiring the sophisticated math behind the codes popular in today's computer systems.
现实世界中的错误纠正和检测
ERROR CORRECTION AND DETECTION IN THE REAL WORLD
纠错码诞生于20世纪40年代,距离电子计算机诞生仅一步之遥。回想起来,原因不难理解:早期的计算机相当不可靠,其组件经常出错。但纠错码的真正起源更早,存在于电报和电话等通信系统中。因此,引发纠错码诞生的两大事件都发生在贝尔电话公司的研究实验室也就不足为奇了。我们故事的两位主人公,克劳德·香农和理查德·汉明,都是贝尔实验室的研究员。汉明我们已经认识了:正是他对公司电脑周末崩溃的恼怒,直接促使他发明了第一个纠错码,也就是现在的汉明码。
Error-correcting codes sprang into existence in the 1940s, rather soon after the birth of the electronic computer itself. In retrospect, it's not hard to see why: early computers were rather unreliable, and their components frequently produced errors. But the true roots of error-correcting codes lie even earlier, in communication systems such as telegraphs and telephones. So it is not altogether surprising that the two major events triggering the creation of error-correcting codes both occurred in the research laboratories of the Bell Telephone Company. The two heroes of our story, Claude Shannon and Richard Hamming, were both researchers at Bell Labs. Hamming we have met already: it was his annoyance at the weekend crashes of a company computer that led directly to his invention of the first error-correcting codes, now known as Hamming codes.
然而,纠错码只是信息论这个更大学科的一部分,大多数计算机科学家将信息论领域的诞生追溯到克劳德·香农 1948 年的一篇论文。这篇题为“通信的数学理论”的非凡论文在香农的一本传记中被描述为“信息时代的大宪章”。欧文·里德(下文提到的里德-所罗门码的共同发明人)在谈到同一篇论文时说:“本世纪很少有其他著作对科学和工程产生如此大的影响。通过这篇里程碑式的论文……他最深刻地改变了通信理论和实践的方方面面。”为什么受到如此高的赞誉?香农通过数学证明,原则上可以在嘈杂、易出错的链路上实现令人惊讶的高无差错通信速率。直到几十年后,科学家才在实践中接近达到香农理论上的最大通信速率。
However, error-correcting codes are just one part of a larger discipline called information theory, and most computer scientists trace the birth of the field of information theory to a 1948 paper by Claude Shannon. This extraordinary paper, entitled “The Mathematical Theory of Communication,” is described in one biography of Shannon as “the Magna Carta of the information age.” Irving Reed (co-inventor of the Reed-Solomon codes mentioned below) said of the same paper: “Few other works of this century have had greater impact on science and engineering. By this landmark paper…he has altered most profoundly all aspects of communication theory and practice.” Why the high praise? Shannon demonstrated through mathematics that it was possible, in principle, to achieve surprisingly high rates of error-free communication over a noisy, error-prone link. It was not until many decades later that scientists came close to achieving Shannon's theoretical maximum communication rate in practice.
顺便说一句,香农显然兴趣广泛。作为1956年达特茅斯人工智能会议(第六章末尾讨论)的四位主要组织者之一,他密切参与了另一个领域的创立:人工智能。但他的贡献远不止于此。他还骑独轮车,并制作了一辆听起来不可思议的独轮车,车轮是椭圆形(即非圆形),这意味着骑手会随着独轮车的前进而上下移动!
Incidentally, Shannon was apparently a man of extremely diverse interests. As one of the four main organizers of the 1956 Dartmouth AI conference (discussed at the end of chapter 6), he was intimately involved in the founding of another field: artificial intelligence. But it doesn't stop there. He also rode unicycles and built an improbable-sounding unicycle with an elliptical (i.e., noncircular) wheel, meaning that the rider moved up and down as the unicycle moved forward!
香农的工作将汉明码置于更广阔的理论框架中,并为许多进一步的进展奠定了基础。汉明码因此被用于一些早期的计算机,并且至今仍广泛用于某些类型的存储系统。另一个重要的码族被称为里德-所罗门码。这些码可以调整以纠正每个码字中的大量错误。(将其与第67页图中的(7,4)汉明码进行对比,后者在每个7位码字中只能纠正一个错误。)里德-所罗门码基于数学的一个分支——有限域代数,但你可以粗略地将它们视为阶梯校验和与二维精确定位技巧的结合。它们被用于CD、DVD和计算机硬盘。
Shannon's work placed Hamming codes in a larger theoretical context and set the stage for many further advances. Hamming codes were thus used in some of the earliest computers and are still widely used in certain types of memory systems. Another important family of codes is known as the Reed-Solomon codes. These codes can be adapted to correct for a large number of errors per codeword. (Contrast this with the (7,4) Hamming code in the figure on page 67, which can correct only one error in each 7-digit code word.) ReedSolomon codes are based on a branch of mathematics called finite field algebra, but you can think of them, very roughly, as combining the features of the staircase checksum and the two-dimensional pinpoint trick. They are used in CDs, DVDs, and computer hard drives.
校验和在实践中也被广泛使用,通常用于检测而不是纠正错误。也许最普遍的例子是以太网,这是当今地球上几乎所有计算机都在使用的网络协议。以太网采用称为 CRC-32 的校验和来检测错误。最常见的互联网协议称为 TCP(传输控制协议),它也对其发送的每个数据块或数据包使用校验和。校验和不正确的数据包会被丢弃,因为 TCP 旨在在必要时自动重新传输它们。互联网上发布的软件包通常使用校验和进行验证;流行的校验和包括称为 MD5 的校验和和以及称为 SHA-1 的校验和。两者都旨在作为加密散列函数,可防止软件被恶意更改以及随机通信错误。 MD5 校验和大约有 40 位数字,SHA-1 大约有 50 位数字,同一系列中还有一些更抗错误的校验和,例如 SHA-256(大约 75 位数字)和 SHA-512(大约 150 位数字)。
Checksums are also widely used in practice, typically for detecting rather than correcting errors. Perhaps the most pervasive example is Ethernet, the networking protocol used by almost every computer on the planet these days. Ethernet employs a checksum called CRC-32 to detect errors. The most common internet protocol, called TCP (for Transmission Control Protocol), also uses checksums for each chunk, or packet, of data that it sends. Packets whose checksums are incorrect are simply discarded, because TCP is designed to automatically retransmit them later if necessary. Software packages published on the internet are often verified using checksums; popular ones include a checksum called MD5, and another called SHA-1. Both are intended to be cryptographic hash functions, providing protection against malicious alteration of the software as well as random communication errors. MD5 checksums have about 40 digits, SHA-1 produces about 50 digits, and there are some even more error-resistant checksums in the same family, such as SHA-256 (about 75 digits) and SHA-512 (about 150 digits).
纠错和检错码的科学研究正在不断发展。自20世纪90年代以来,一种被称为低密度奇偶校验码的方法引起了广泛关注。如今,这些代码已广泛应用于从卫星电视到深空探测器通信等各种领域。所以,下次你在周末享受高清卫星电视时,不妨想想这个妙趣横生的讽刺:正是理查德·汉明在周末与早期计算机的较量中遭遇的挫败,才成就了我们今天的周末娱乐。
The science of error-correcting and error-detecting codes continues to expand. Since the 1990s, an approach known as low-density parity-check codes has received considerable attention. These codes are now used in applications ranging from satellite TV to communication with deep space probes. So the next time you enjoy some high-definition satellite TV on the weekend, spare a thought for this delicious irony: it was the frustration of Richard Hamming's weekend battle with an early computer that led to our own weekend entertainment today.
6
6
模式识别:从经验中学习
Pattern Recognition: Learning from Experience
— A DA L OVELACE,摘自她 1843 年关于分析机的笔记
—ADA LOVELACE, from her 1843 notes on the Analytical Engine
在之前的每章中,我们都探讨了计算机能力远远超越人类的领域。例如,计算机通常可以在一两秒内加密或解密一个大文件,而人类手工完成同样的计算则需要数年时间。举一个更极端的例子,想象一下,按照第三章描述的算法,人类手动计算数十亿个网页的PageRank需要多长时间。这项任务如此庞大,以至于在实践中,人类根本不可能完成。然而,网络搜索公司的计算机却在不断地进行着这些计算。
In each previous chapter, we've looked at an area in which the ability of computers far outstrips the ability of humans. For example, a computer can typically encrypt or decrypt a large file within a second or two, whereas it would take a human many years to perform the same computations by hand. For an even more extreme example, imagine how long it would take a human to manually compute the PageRank of billions of web pages according to the algorithm described in chapter 3. This task is so vast that, in practice, it is impossible for a human. Yet the computers at web search companies are constantly performing these computations.
另一方面,在本章中,我们将探讨人类具有天然优势的领域:模式识别。模式识别是人工智能的一个子集,包括人脸识别、物体识别、语音识别和手写识别等任务。更具体的例子包括判断一张照片是否是你姐姐的照片,或者判断手写信封上写的城市和州。因此,模式识别可以更广泛地定义为让计算机基于包含大量可变性的输入数据“智能地”行动的任务。
In this chapter, on the other hand, we examine an area in which humans have a natural advantage: the field of pattern recognition. Pattern recognition is a subset of artificial intelligence and includes tasks such as face recognition, object recognition, speech recognition, and handwriting recognition. More specific examples would include the task of determining whether a given photograph is a picture of your sister, or determining the city and state written on a hand-addressed envelope. Thus, pattern recognition can be defined more generally as the task of getting computers to act “intelligently” based on input data that contains a lot of variability.
这里用引号引起来的“智能”是有原因的:计算机能否表现出真正的智能是一个极具争议的问题。本章开头的引文代表了这场争论最早的炮火之一:1843 年,艾达·洛夫莱斯评论了早期机械计算机分析机的设计。洛夫莱斯有时被称为世界上第一位计算机程序员,因为她对分析机有着深刻的见解。但在这一声明中,她强调计算机缺乏原创性:它们必须盲目地遵循人类程序员的指令。如今,计算机科学家们对于计算机原则上是否能够表现出智能意见不一。如果再加上哲学家、神经科学家和神学家的意见,这场争论将变得更加复杂。
The word “intelligently” is in quotation marks here for good reason: the question of whether computers can ever exhibit true intelligence is highly controversial. The opening quotation of this chapter represents one of the earliest salvos in this debate: Ada Lovelace commenting, in 1843, on the design of an early mechanical computer called the Analytical Engine. Lovelace is sometimes described as the world's first computer programmer because of her profound insights about the Analytical Engine. But in this pronouncement, she emphasizes that computers lack originality: they must slavishly follow the instructions of their human programmers. These days, computer scientists disagree on whether computers can, in principle, exhibit intelligence. And the debate becomes even more complex if philosophers, neuroscientists, and theologians are thrown into the mix.
幸运的是,我们不必在此解决机器智能的悖论。就我们的目的而言,不妨将“智能”一词替换为“有用”。因此,模式识别的基本任务是获取一些变异性极高的数据——例如不同光照条件下不同人脸的照片,或由许多不同人手写的许多不同单词的样本——并利用这些数据进行一些有用的处理。人类无疑可以智能地处理这些数据:我们可以以惊人的准确度识别人脸,并且无需事先查看笔迹样本即可读懂几乎所有人的笔迹。事实证明,计算机在这些任务上的表现远逊于人类。但是,一些巧妙的算法已经出现,使计算机能够在某些模式识别任务上取得良好的性能。在本章中,我们将学习其中三种算法:最近邻分类器、决策树和人工神经网络。但首先,我们需要对试图解决的问题进行更科学的描述。
Fortunately, we don't have to resolve the paradoxes of machine intelligence here. For our purposes, we might as well replace the word “intelligent” with “useful.” So the basic task of pattern recognition is to take some data with extremely high variability—such as photographs of different faces in different lighting conditions, or samples of many different words handwritten by many different people—and do something useful with it. Humans can unquestionably process such data intelligently: we can recognize faces with uncanny accuracy, and read the handwriting of virtually anyone without having to see samples of their writing in advance. It turns out that computers are vastly inferior to humans at such tasks. But some ingenious algorithms have emerged that enable computers to achieve good performance on certain pattern recognition tasks. In this chapter, we will learn about three of these algorithms: nearest-neighbor classifiers, decision trees, and artificial neural networks. But first, we need a more scientific description of the problem we are trying to solve.
有什么问题?
WHAT'S THE PROBLEM?
模式识别的任务乍一看似乎千差万别。计算机能用一套模式识别技术工具来识别笔迹、人脸、语音等等吗?这个问题的一个可能的答案就在我们眼前:我们人类的大脑在各种各样的识别任务中都能达到惊人的速度和准确性。我们能编写一个计算机程序来实现同样的效果吗?
The tasks of pattern recognition might seem, at first, to be almost absurdly diverse. Can computers use a single toolbox of pattern recognition techniques to recognize handwriting, faces, speech, and more? One possible answer to this question is staring us (literally) in the face: our own human brains achieve superb speed and accuracy in a wide array of recognition tasks. Could we write a computer program to achieve the same thing?
在讨论此类程序可能使用的技术之前,我们需要以某种方式统一令人眼花缭乱的一系列任务,并定义一个我们试图解决的单一问题。这里的标准方法是将模式识别视为分类问题。我们假设待处理的数据被分成合理的块,称为样本,并且每个样本属于一组固定的可能类别中的一个。例如,在人脸识别问题中,每个样本是一张人脸图片,而类别是系统能够识别的人的身份。在某些问题中,只有两个类别。一个常见的例子是针对特定疾病的医学诊断,其中两个类别可能是“健康”和“患病”,而每个数据样本可能包含单个患者的所有测试结果(例如,血压、体重、X 光片以及可能的其他许多信息)。因此,计算机的任务是处理它从未见过的新数据样本,并将每个样本分类到其中一个可能的类别中。
Before we can discuss the techniques that such a program might use, we need to somehow unify the bewildering array of tasks and define a single problem that we are trying to solve. The standard approach here is to view pattern recognition as a classification problem. We assume that the data to be processed is divided up into sensible chunks called samples, and that each sample belongs to one of a fixed set of possible classes. For example, in a face recognition problem, each sample would be a picture of a face, and the classes would be the identities of the people the system can recognize. In some problems, there are only two classes. A common example of this is in medical diagnosis for a particular disease, where the two classes might be “healthy” and “sick,” while each data sample could consist of all the test results for a single patient (e.g., blood pressure, weight, x-ray images, and possibly many other things). So the computer's task is to process new data samples that it has never seen before and classify each sample into one of the possible classes.
为了更具体,我们现在先集中讨论一个模式识别任务。这就是识别手写数字的任务。下一页的图表展示了一些典型的数据样本。这个问题恰好有十个类别:数字 0、1、2、3、4、5、6、7、8 和 9。因此,任务就是将手写数字样本归类到这十个类别中。这当然是一个具有重要实际意义的问题,因为在美国和许多其他国家,邮件的地址都是使用数字邮政编码。如果计算机能够快速准确地识别这些邮政编码,那么机器对邮件的分拣效率将远高于人工分拣。
To make things concrete, let's focus on a single pattern recognition task for now. This is the task of recognizing handwritten digits. Some typical data samples are shown in the figure on the facing page. There are exactly ten classes in this problem: the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. So the task is to classify samples of handwritten digits as belonging to one of these ten classes. This is, of course, a problem of great practical significance, since mail in the United States and many other countries is addressed using numeric postcodes. If a computer can rapidly and accurately recognize these postcodes, mail can be sorted by machines much more efficiently than by humans.
显然,计算机本身并不具备手写数字的固有知识。事实上,人类也没有这种固有知识:我们学习识别数字和其他笔迹,是通过其他人的明确教导和我们用来自学的示例相结合的方式。这两种策略(明确教导和从示例中学习)也用于计算机模式识别。然而,事实证明,除了最简单的任务外,对计算机进行明确教导都是无效的。例如,我们可以将我家中的气候控制器视为一个简单的分类系统。数据样本包含当前温度和时间,三个可能的类别是“暖气开启”、“空调开启”和“两者皆关闭”。因为我白天在办公室工作,所以我将系统编程为在白天“两者皆关闭”,而在白天以外的时间,如果温度过低则“暖气开启”,如果温度过高则“空调开启”。因此,在对恒温器进行编程的过程中,我在某种意义上“教会”系统对这三个类别进行分类。
Obviously, computers have no built-in knowledge of what handwritten digits look like. And, in fact, humans don't have this built-in knowledge either: we learn how to recognize digits and other handwriting, through some combination of explicit teaching by other humans and by seeing examples that we use to teach ourselves. These two strategies (explicit teaching and learning from examples) are also used in computer pattern recognition. However, it turns out that for all but the simplest of tasks, explicit teaching of computers is ineffective. For example, we can think of the climate controls in my house as a simple classification system. A data sample consists of the current temperature and time of day, and the three possible classes are “heat on,” “air-conditioning on,” and “both off.” Because I work in an office during the day, I program the system to be “both off” during daytime hours, and outside those hours it is “heat on” if the temperature is too low and “air-conditioning on” if the temperature is too high. Thus, in the process of programming my thermostat, I have in some sense “taught” the system to perform classification into these three classes.
遗憾的是,还没有人能够明确地“教会”计算机解决更有趣的分类任务,比如下一页的手写数字。因此,计算机科学家转向了另一种可用的策略:让计算机自动“学习”如何对样本进行分类。基本策略是给计算机提供大量标记数据:已经分类的样本。第 84 页的图显示了手写数字任务的一些标记数据示例。由于每个样本都带有一个标签(即其类别),因此计算机可以使用各种分析技巧来提取每个类别的特征。当稍后向计算机提供未标记的样本时,计算机可以通过选择特征与未标记样本最相似的样本来猜测其类别。
Unfortunately, no one has ever been able to explicitly “teach” a computer to solve more interesting classification tasks, such as the handwritten digits on the next page. So computer scientists turn to the other strategy available: getting a computer to automatically “learn” how to classify samples. The basic strategy is to give the computer a large amount of labeled data: samples that have already been classified. The figure on page 84 shows an example of some labeled data for the handwritten digit task. Because each sample comes with a label (i.e., its class), the computer can use various analytical tricks to extract characteristics of each class. When it is later presented with an unlabeled sample, the computer can guess its class by choosing the one whose characteristics are most similar to the unlabeled sample.
大多数模式识别任务都可以表述为分类问题。这里,任务是将每个手写数字分类为 0、1、…、9 这 10 个数字之一。数据来源:LeCun等人1998 年的 MNIST 数据。
Most pattern recognition tasks can be phrased as classification problems. Here, the task is to classify each handwritten digit as one of the 10 digits 0,1,…, 9. Data source: MNIST data of LeCun et al. 1998.
学习每个类别特征的过程通常称为“训练”,而标记数据本身就是“训练数据”。简而言之,模式识别任务分为两个阶段:第一阶段是训练阶段,计算机根据一些标记的训练数据学习类别;第二阶段是分类阶段,计算机对新的、未标记的数据样本进行分类。
The process of learning the characteristics of each class is often called “training,” and the labeled data itself is the “training data.” So in a nutshell, pattern recognition tasks are divided into two phases: first, a training phase in which the computer learns about the classes based on some labeled training data; and second, a classification phase in which the computer classifies new, unlabeled data samples.
最近邻技巧
THE NEAREST-NEIGHBOR TRICK
这是一个有趣的分类任务:仅根据一个人的家庭住址,你能预测他会向哪个政党捐款吗?显然,这是一个即使是人类也无法完美准确完成的分类任务:一个人的住址信息不足以预测其政治倾向。但是,尽管如此,我们仍然希望训练一个分类系统,仅根据家庭住址就能预测一个人最有可能向哪个政党捐款。
Here's an interesting classification task: can you predict, based only on a person's home address, which political party that person will make a donation to? Obviously, this is an example of a classification task that cannot be performed with perfect accuracy, even by a human: a person's address doesn't tell us enough to predict political affiliations. But, nevertheless, we would like to train a classification system that predicts which party a person is most likely to donate to, based only on a home address.
为了训练分类器,计算机需要一些带标签的数据。这里,每个数据样本(手写数字)都带有一个标签,指定 10 个可能的数字之一。标签位于左侧,训练样本位于右侧的方框中。数据来源:LeCun等人1998 年的 MNIST 数据。
To train a classifier, a computer needs some labeled data. Here, each sample of data (a handwritten digit) comes with a label specifying one of the 10 possible digits. The labels are on the left, and the training samples are in boxes on the right. Data source: MNIST data of LeCun et al. 1998.
下一页的图表展示了一些可用于此任务的训练数据。它展示了堪萨斯州某个社区居民在 2008 年美国总统大选中实际捐款的地图。(如果您感兴趣的话,这是堪萨斯州威奇托市的学院山社区。)为了清晰起见,地图上没有显示街道,但每栋捐款房屋的实际地理位置都准确显示。向民主党捐款的房屋标有“D”,向共和党捐款的房屋标有“R”。
The figure on the next page shows some training data that could be used for this task. It shows a map of the actual donations made by the residents of a particular neighborhood in Kansas, in the 2008 U.S. presidential election. (In case you are interested, this is the College Hill neighborhood of Wichita, Kansas.) For clarity, streets are not shown on the map, but the actual geographic location of each house that made a donation is shown accurately. Houses that donated to the Democrats are marked with a “D,” and an “R” marks donations to the Republicans.
关于训练数据就讲到这里。当我们需要将新的样本分类为民主党或共和党时,我们该怎么做呢?第 86 页的图具体地展示了这一点。训练数据的显示方式与之前相同,但此外还有两个新的位置以问号的形式显示。我们首先关注上方的问号。只需浏览一下,不进行任何科学分析,你猜这个问号最有可能属于哪个类别?它周围似乎都是民主党的捐款,所以“D”的可能性很大。左下方的另一个问号呢?这个问号周围并没有共和党的捐款,但它似乎更倾向于共和党而不是民主党,所以“R”是一个不错的猜测。
So much for the training data. What are we going to do when given a new sample that needs to be classified as either Democrat or Republican? The figure on page 86 shows this concretely. The training data is shown as before, but in addition there are two new locations shown as question marks. Let's focus first on the upper question mark. Just by glancing at it, and without trying to do anything scientific, what would you guess is the most likely class for this question mark? It seems to be surrounded by Democratic donations, so a “D” seems quite probable. How about the other question mark, on the lower left? This one isn't exactly surrounded by Republican donations, but it does seem to be more in Republican territory than Democrat, so “R” would be a good guess.
用于预测政党捐款的训练数据。“D”表示捐赠给民主党的房屋,“R”表示捐赠给共和党的房屋。数据来源:Fundrace 项目,《赫芬顿邮报》。
Training data for predicting political party donations. A “D” marks a house that donated to the Democrats, and “R” marks Republican donations. Data source: Fundrace project, Huffington Post.
信不信由你,我们刚刚掌握了有史以来最强大、最实用的模式识别技术之一:计算机科学家称之为最近邻分类器的方法。最简单的“最近邻”技巧,其作用正如其名。给定一个未分类的数据样本,首先在训练数据中找到与该样本最近的邻居,然后使用这个最近邻居的类别作为预测。在下一页的图中,这相当于猜测与每个问号最接近的字母。
Believe it or not, we have just mastered one of the most powerful and useful pattern recognition techniques ever invented: an approach that computer scientists call the nearest-neighbor classifier. In its simplest form, this “nearest-neighbor” trick does just what it sounds like. When you are given an unclassified data sample, first find the nearest neighbor to that sample in the training data and then use the class of this nearest neighbor as your prediction. In the figure on the next page, this just amounts to guessing the closest letter to each of the question marks.
这个技巧的一个稍微复杂一点的版本被称为“K最近邻”,其中K是一个较小的数字,比如3或5。在这个公式中,你检查问号的K个最近邻,并选择其中最受欢迎的类别。我们可以在第87页的图中看到这个方法。这里,与问号最近的单个邻居是共和党的捐款,因此最简单的最近邻技巧会将这个问号归类为“R”。但如果我们使用3个最近邻,我们会发现其中包括两个民主党的捐款和一个共和党的捐款——因此在这组特定的邻居中,民主党的捐款更受欢迎,问号被归类为“D”。
A slightly more sophisticated version of this trick is known as “K-nearest-neighbors,” where K is a small number like 3 or 5. In this formulation, you examine the K nearest neighbors of the question mark and choose the class that is most popular among these neighbors. We can see this in action in the figure on page 87. Here, the nearest single neighbor to the question mark is a Republican donation, so the simplest form of the nearest-neighbor trick would classify this question mark as an “R.” But if we move to using 3 nearest neighbors, we find that this includes two Democrat donations and one Republican donation—so in this particular set of neighbors, Democrat donations are more popular and the question mark is classified as a “D.”
使用最近邻技巧进行分类。每个问号都会被分配到其最近邻的类别。上方的问号会变成“D”,下方的问号会变成“R”。数据来源:Fundrace 项目,《赫芬顿邮报》。
Classification using the nearest-neighbor trick. Each question mark is assigned the class of its nearest neighbor. The upper question mark becomes a “D,” while the lower one becomes an “R.” Data source: Fundrace project, Huffington Post.
那么,我们应该使用多少个邻居呢?答案取决于要解决的问题。通常,从业者会尝试几个不同的值,看看哪个效果最好。这听起来可能不科学,但它反映了有效模式识别系统的现实,这些系统通常是结合数学洞察力、良好的判断力和实践经验来构建的。
So, how many neighbors should we use? The answer depends on the problem being tackled. Typically, practitioners try a few different values and see what works best. This might sound unscientific, but it reflects the reality of effective pattern recognition systems, which are generally crafted using a combination of mathematical insight, good judgment, and practical experience.
不同类型的“最近”邻居
Different Kinds of “Nearest” Neighbors
到目前为止,我们已经解决了一个问题,这个问题被特意设计成一个简单直观的解释,来解释一个数据样本作为另一个数据样本的“最近”邻居的含义。由于每个数据点都位于地图上,我们可以直接使用点之间的地理距离来确定哪些点最近。但是,如果每个数据样本都是像第83页那样的手写数字,我们该怎么办呢?我们需要某种方法来计算两个不同手写数字示例之间的“距离”。下一页的图展示了一种方法。
So far, we've worked on a problem that was deliberately chosen to have a simple, intuitive interpretation of what it means for one data sample to be the “nearest” neighbor of another data sample. Because each data point was located on a map, we could just use the geographic distance between points to work out which ones were closest. But what are we going to do when each data sample is a handwritten digit like the ones on page 83? We need some way of computing the “distance” between two different examples of handwritten digits. The figure on the following page shows one way of doing this.
使用K最近邻的示例。当仅使用单个最近邻时,问号会被分类为“R”,但如果使用三个最近邻,则会变成“D”。数据来源:Fundrace 项目,《赫芬顿邮报》。
An example of using K-nearest-neighbors. When using only the single nearest neighbor, the question mark is classified as an “R,” but with three nearest neighbors, it becomes a “D.” Data source: Fundrace project, Huffington Post.
其基本思想是测量数字图像之间的差异,而不是它们之间的地理距离。差异将以百分比来衡量——因此,差异只有 1% 的图像是近邻,而差异 99% 的图像则彼此相距甚远。该图展示了具体示例。(与模式识别任务中常见的情况一样,输入通常经过某些预处理步骤。在本例中,每个数字都会被重新缩放为与其他数字相同的大小,并位于图像的中心。)在图的上行,我们看到两张不同的手写数字 2 的图像。通过对这些图像进行某种“减法”,我们可以生成右侧的图像,除了两张图像的少数不同之处外,其余部分均为白色。事实证明,这张差异图像中只有 6% 是黑色,因此这两个手写数字 2 的示例是相对较近的邻居。另一方面,在图的下行,我们看到了不同数字(2 和 9)的图像相减的结果。右侧的差异图像中黑色像素更多,因为两幅图像在更多地方存在差异。事实上,这幅图像大约有 21% 是黑色的,所以这两幅图像并不是特别接近。
The basic idea is to measure the difference between images of digits, rather than a geographical distance between them. The difference will be measured as a percentage—so images that are only 1% different are very close neighbors, and images that are 99% different are very far from each other. The figure shows specific examples. (As is usual in pattern recognition tasks, the inputs have undergone certain preprocessing steps. In this case, each digit is rescaled to be the same size as the others and centered within its image.) In the top row of the figure, we see two different images of handwritten 2's. By doing a sort of “subtraction” of these images, we can produce the image on the right, which is white everywhere except at the few places where the two images were different. It turns out that only 6% of this difference image is black, so these two examples of handwritten 2's are relatively close neighbors. On the other hand, in the bottom row of the figure, we see the results when images of different digits (a 2 and a 9) are subtracted. The difference image on the right has many more black pixels, because the images disagree in more places. In fact, about 21% of this image is black, so the two images are not particularly close neighbors.
计算两幅手写数字之间的“距离”。在每一行中,用第一幅图像减去第二幅图像,在右侧生成一幅新图像,突出显示两幅图像之间的差异。突出显示的差异图像的百分比可以视为原始图像之间的“距离”。数据来源:LeCun等人于 1998 年收集的 MNIST 数据。
Computing the “distance” between two handwritten digits. In each row, the second image is subtracted from the first one, resulting in a new image on the right that highlights the differences between the two images. The percentage of this difference image that is highlighted can be regarded as a “distance” between the original images. Data source: MNIST data of LeCun et al., 1998.
既然我们知道了如何找出手写数字之间的“距离”,那么构建一个模式识别系统就很容易了。我们从大量的训练数据开始——就像第 84 页的图所示,但样本数量要大得多。典型的此类系统可能使用 100,000 个带标签的样本。现在,当系统看到一个新的、未标记的手写数字时,它可以搜索所有 100,000 个样本,找到与被分类数字最接近的单个样本。记住,当我们在这里说“最接近的邻居”时,我们实际上是指最小的百分比差异,就像上图中的方法计算的那样。未标记的数字被分配与这个最近邻居相同的标签。
Now that we know how to find out the “distance” between handwritten digits, it's easy to build a pattern recognition system for them. We start off with a large amount of training data—just as in the figure on page 84, but with a much bigger number of examples. Typical systems of this sort might use 100,000 labeled examples. Now, when the system is presented with a new, unlabeled handwritten digit, it can search through all 100,000 examples to find the single example that is the closest neighbor to the one being classified. Remember, when we say “closest neighbor” here, we really mean the smallest percentage difference, as computed by the method in the figure above. The unlabeled digit is assigned the same label as this nearest neighbor.
事实证明,使用这种“最近邻”距离的系统效果相当不错,准确率约为 97%。研究人员投入了巨大的精力,为“最近邻”距离提出了更复杂的定义。凭借最先进的距离测量方法,最近邻分类器对手写数字的识别准确率可达 99.5% 以上,这与“支持向量机”和“卷积神经网络”等更复杂的模式识别系统的性能相当。最近邻技巧堪称计算机科学的奇迹,它将优雅的简洁与卓越的有效性完美结合。
It turns out that a system using this type of “closest neighbor” distance works rather well, with about 97% accuracy. Researchers have put enormous effort into coming up with more sophisticated definitions for the “closest neighbor” distance. With a state-of-the-art distance measure, nearest-neighbor classifiers can achieve over 99.5% accuracy on handwritten digits, which is comparable to the performance of much more complex pattern recognition systems, with fancy-sounding names such as “support vector machines” and “convolutional neural networks.” The nearest-neighbor trick is truly a wonder of computer science, combining elegant simplicity with remarkable effectiveness.
前面强调过,模式识别系统的工作分为两个阶段:学习(或训练)阶段,在此阶段处理训练数据以提取类别的某些特征;分类阶段,在此阶段对新的未标记数据进行分类。那么,到目前为止,我们研究的最近邻分类器的学习阶段发生了什么?似乎我们获取训练数据,不费心从中学习任何东西,直接使用最近邻技巧进行分类。这恰好是最近邻分类器的一个特殊属性:它们不需要任何显式的学习阶段。在下一节中,我们将研究另一种类型的分类器,其中学习起着更为重要的作用。
It was emphasized earlier that pattern recognition systems work in two phases: a learning (or training) phase in which the training data is processed to extract some characteristics of the classes, and a classification phase in which new, unlabeled data is classified. So, what happened to the learning phase in the nearest-neighbor classifier we've examined so far? It seems as though we take the training data, don't bother to learn anything from it, and jump straight into classification using the nearest-neighbor trick. This happens to be a special property of nearest-neighbor classifiers: they don't require any explicit learning phase. In the next section, we'll look at a different type of classifier in which learning plays a much more important role.
二十个问题技巧:决策树
THE TWENTY-QUESTIONS TRICK: DECISION TREES
“二十个问题”游戏对计算机科学家来说有着特殊的吸引力。在这个游戏中,一位玩家想出一个物体,其他玩家则需要根据不超过二十个是非问题的答案来猜测该物体的身份。你甚至可以购买小型手持设备,用它来和你玩二十个问题游戏。虽然这个游戏通常用于娱乐儿童,但成年人玩起来也会获得意想不到的满足感。几分钟后,你开始意识到游戏有“好问题”和“坏问题”。好问题保证能提供大量的“信息”(无论这意味着什么),而坏问题则不然。例如,第一个问题就不应该问“它是铜做的吗?”,因为如果答案是“否”,那么可能性的范围就缩小了很小。这些关于好问题和坏问题的直觉,是信息论这个迷人领域的核心。它们也是简单而强大的模式识别技术——决策树的核心。
The game of “twenty questions” holds a special fascination for computer scientists. In this game, one player thinks of an object, and the other players have to guess the identity of the object based only on the answers to no more than twenty yes-no questions. You can even buy small handheld devices that will play twenty questions against you. Although this game is most often used to entertain children, it is surprisingly rewarding to play as an adult. After a few minutes, you start to realize that there are “good questions” and “bad questions.” The good questions are guaranteed to give you a large amount of “information” (whatever that means), while the bad ones are not. For example, it's a bad idea to ask “Is it made of copper?” as your first question, because if the answer is “no,” the range of possibilities has been narrowed very little. These intuitions about good questions and bad questions lie at the heart of a fascinating field called information theory. And they are also central to a simple and powerful pattern recognition technique called decision trees.
决策树本质上就是一个预先设计好的包含二十个问题的游戏。下一页的图展示了一个简单的例子。这是一个用于决定是否带伞的决策树。你只需从树的顶部开始,然后沿着问题的答案走。当你到达树底部的某个方框时,你就得到了最终的输出。
A decision tree is basically just a pre-planned game of twenty questions. The figure on the next page shows a trivial example. It's a decision tree for deciding whether or not to take an umbrella with you. You just start at the top of the tree and follow the answers to the questions. When you arrive at one of the boxes at the bottom of the tree, you have the final output.
你可能想知道这与模式识别和分类有什么关系。事实上,只要有足够多的训练数据,就有可能学习出一棵能够准确分类的决策树。
You are probably wondering what this has to do with pattern recognition and classification. Well, it turns out that if you are given a sufficient amount of training data, it is possible to learn a decision tree that will produce accurate classifications.
让我们看一个鲜为人知但极其重要的问题——网络垃圾——的例子。我们已经在第三章中遇到过这个问题,了解了一些不道德的网站运营商如何通过人为地创建大量指向某些页面的超链接来操纵搜索引擎的排名算法。这些狡猾的网站管理员使用的一个相关策略是创建对人类毫无用处的网页,但其中包含精心设计的内容。您可以在对面页面上的图中看到一小段真实的网络垃圾页面的摘录。请注意,其中的文字毫无意义,但却反复列出了与在线学习相关的热门搜索词。这个特定的网络垃圾试图提高它提供链接的某些在线学习网站的排名。
Let's look at an example based on the little-known, but extremely important, problem known as web spam. We already encountered this in chapter 3, where we saw how some unscrupulous website operators try to manipulate the ranking algorithms of search engines by creating an artificially large number of hyperlinks to certain pages. A related strategy used by these devious webmasters is to create web pages that are of no use to humans, but with specially crafted content. You can see a small excerpt from a real web spam page in the figure on the facing page. Notice how the text makes no sense, but repeatedly lists popular search terms related to online learning. This particular piece of web spam is trying to increase the ranking of certain online learning sites that it provides links to.
“我应该带伞吗?”的决策树
Decision tree for “Should I take an umbrella?”
搜索引擎自然会花费大量精力来识别和清除网络垃圾。这恰恰是模式识别的完美应用:我们可以获取大量训练数据(在本例中是网页),手动将其标记为“垃圾”或“非垃圾”,然后训练某种分类器。这正是微软研究院的一些科学家在 2006 年所做的。他们发现,在这个特定问题上表现最佳的分类器是一个老牌的分类器:决策树。您可以在第 92 页看到他们提出的决策树的一小部分。
Naturally, search engines expend a lot of effort on trying to identify and eliminate web spam. It's a perfect application for pattern recognition: we can acquire a large amount of training data (in this case, web pages), manually label them as “spam” or “not spam,” and train some kind of classifier. That's exactly what some scientists at Microsoft Research did in 2006. They discovered that the best-performing classifier on this particular problem was an old favorite: the decision tree. You can see a small part of the decision tree they came up with on page 92.
虽然完整的树依赖于许多不同的属性,但此处显示的部分侧重于页面中单词的流行度。网络垃圾制造者喜欢包含大量热门词汇来提高排名,因此,流行词汇比例较低表明垃圾信息的可能性较低。这解释了树中的第一个决策,其他决策也遵循类似的逻辑。这棵树的准确率约为 90%——远非完美,但仍然是对抗网络垃圾制造者的宝贵武器。
Although the full tree relies on many different attributes, the part shown here focuses on the popularity of the words in the page. Web spammers like to include a large number of popular words in order to improve their rankings, so a small percentage of popular words indicates a low likelihood of spam. That explains the first decision in the tree, and the others follow a similar logic. This tree achieves an accuracy of about 90%—far from perfect, but nevertheless an invaluable weapon against web spammers.
人力资源管理学习、基于网络的远程教育、
魔法语言学习、在线MBA证书和自主学习——各种法律学位在线学习、在线教育、研究生学位。IT咨询和计算机培训课程。继续医学教育会议的网络开发学位、印第安纳州在线教育新闻、非大学学位在线服务信息系统管理项目——计算机工程技术项目设置在线课程和MBA新语言学习在线学位、在线护理继续教育学分、黑暗远程教育研究生热门PC服务和支持课程。
human resource management study, web based distance education
Magic language learning online mba certificate and self-directed learning—various law degree online study, on online an education an graduate an degree. Living it consulting and computer training courses. So web development degree for continuing medical education conference, news indiana online education, none college degree online service information systems management program—in computer engineering technology program set online classes and mba new language learning online degrees online nursing continuing education credits, dark distance education graduate hot pc service and support course.
摘自“网络垃圾”页面。该页面不包含任何对人类有用的信息——其唯一目的是操纵网络搜索排名。来源:Ntoulas等人, 2006 年。
Excerpt from a page of “web spam.” This page contains no information useful to humans—its sole purpose is to manipulate web search rankings. Source: Ntoulas et al. 2006.
重要的并非树本身的细节,而是整棵树是由计算机程序基于大约 17,000 个网页的训练数据自动生成的。这些“训练”页面由真人进行分类,以确定其是否为垃圾信息。良好的模式识别系统可能需要大量的人工投入,但这种一次性投入将带来长期的回报。
The important thing to understand is not the details of the tree itself, but the fact that the entire tree was generated automatically, by a computer program, based on training data from about 17,000 web pages. These “training” pages were classified as spam or not spam by a real person. Good pattern recognition systems can require significant manual effort, but this is a one-time investment that has a many-time payoff.
与我们之前讨论过的最近邻分类器相比,决策树分类器的学习阶段非常重要。这个学习阶段是如何运作的?其主要原理类似于规划一场包含 20 个问题的精彩游戏。计算机测试大量可能的第一个问题,以找到能够提供最佳信息的问题。然后,它根据训练示例对第一个问题的答案将其分成两组,并为每一组提出一个最佳的第二个问题。它以这种方式继续沿树向下移动,始终根据到达树中特定点的训练示例集来确定最佳问题。如果示例集在某个特定点变得“纯净”——即该集合仅包含垃圾页面或仅包含非垃圾页面——计算机可以停止生成新问题,而是输出与剩余页面对应的答案。
In contrast to the nearest-neighbor classifier we discussed earlier, the learning phase of a decision tree classifier is substantial. How does this learning phase work? The main intuition is the same as planning a good game of twenty questions. The computer tests out a huge number of possible first questions to find the one that yields the best possible information. It then divides the training examples into two groups, depending on their answer to the first question and comes up with a best possible second question for each of those groups. And it keeps on moving down the tree in this way, always determining the best question based on the set of training examples that reach a particular point in the tree. If the set of examples ever becomes “pure” at a particular point—that is, the set contains only spam pages or only non-spam pages—the computer can stop generating new questions and instead output the answer corresponding to the remaining pages.
用于识别网络垃圾的决策树的一部分。图中的点表示为简化起见,树中省略的部分。来源:Ntoulas等人,2006 年。
Part of a decision tree for identifying web spam. The dots indicate parts of the tree that have been omitted for simplicity. Source: Ntoulas et al. 2006.
总而言之,决策树分类器的学习阶段可能很复杂,但它完全自动化,只需执行一次。之后,你就得到了所需的决策树,分类阶段非常简单:就像二十个问题的游戏一样,你沿着问题的答案向下移动树,直到到达输出框。通常只需要几个问题,因此分类阶段非常高效。与最近邻方法相比,最近邻方法的学习阶段无需任何努力,但分类阶段要求我们对所有训练样本(手写数字任务需要100,000个样本)进行比较,以便对每个待分类项目进行比较。
To summarize, the learning phase of a decision tree classifier can be complex, but it is completely automatic and you only have to do it once. After that, you have the decision tree you need, and the classification phase is incredibly simple: just like a game of twenty questions, you move down the tree following the answers to the questions, until you reach an output box. Typically, only a handful of questions are needed and the classification phase is thus extremely efficient. Contrast this with the nearest-neighbor approach, in which no effort was required for the learning phase, but the classification phase required us to do a comparison with all training examples (100,000 of them for the hand-written digits task), for each item to be classified.
在下一节中,我们将遇到神经网络:一种模式识别技术,其学习阶段不仅意义重大,而且直接受到人类和其他动物从周围环境中学习的方式的启发。
In the next section, we encounter neural networks: a pattern recognition technique in which the learning phase is not only significant, but directly inspired by the way humans and other animals learn from their surroundings.
神经网络
NEURAL NETWORKS
自从第一台数字计算机诞生以来,人脑的非凡能力就一直让计算机科学家们着迷并深受启发。最早讨论使用计算机模拟大脑的人之一是英国科学家艾伦·图灵,他也是一位杰出的数学家、工程师和密码破译员。图灵 1950 年的经典论文《计算机器与智能》最著名的是他对计算机是否可以伪装成人类的哲学探讨。这篇论文介绍了一种评估计算机与人类相似性的科学方法,如今被称为“图灵测试”。但在同一篇论文中一个不太为人所知的段落中,图灵直接分析了使用计算机模拟人脑的可能性。他估计只需要几 GB 的内存就足够了。
The remarkable abilities of the human brain have fascinated and inspired computer scientists ever since the creation of the first digital computers. One of the earliest discussions of actually simulating a brain using a computer was by Alan Turing, a British scientist who was also a superb mathematician, engineer, and code-breaker. Turing's classic 1950paper, entitled Computing Machinery and Intelligence, is most famous for a philosophical discussion of whether a computer could masquerade as a human. The paper introduced a scientific way of evaluating the similarity between computers and humans, known these days as a “Turing test.” But in a less well-known passage of the same paper, Turing directly analyzed the possibility of modeling a human brain using a computer. He estimated that only a few gigabytes of memory might be sufficient.
典型的生物神经元。电信号沿箭头所示方向流动。只有当输入信号的总和足够大时,输出信号才会传输。
A typical biological neuron. Electrical signals flow in the directions shown by the arrows. The output signals are only transmitted if the sum of the input signals is large enough.
六十年后,人们普遍认为图灵大大低估了模拟人脑所需的工作量。然而,计算机科学家仍然以各种不同的方式追求这一目标。其中之一就是人工神经网络(简称神经网络)领域的诞生。
Sixty years later, it's generally agreed that Turing significantly underestimated the amount of work required to simulate a human brain. But computer scientists have nevertheless pursued this goal in many different guises. One of the results is the field of artificial neural networks, or neural networks for short.
生物神经网络
Biological Neural Networks
为了帮助我们理解人工神经网络,我们首先需要了解真实的生物神经网络是如何运作的。动物的大脑由称为神经元的细胞组成,每个神经元都与许多其他神经元相连。神经元可以通过这些连接发送电信号和化学信号。一些连接用于接收来自其他神经元的信号;其余连接则将信号传递给其他神经元(见上图)。
To help us understand artificial neural networks, we first need an overview of how real, biological neural networks function. Animal brains consist of cells called neurons, and each neuron is connected to many other neurons. Neurons can send electrical and chemical signals through these connections. Some of the connections are set up to receive signals from other neurons; the remaining connections transmit signals to other neurons (see the figure above).
描述这些信号的一种简单方法是,在任何给定时刻,神经元要么处于“空闲”状态,要么处于“放电”状态。空闲时,神经元不发送任何信号;放电时,神经元会通过所有输出连接频繁地发送突发信号。神经元如何决定何时放电?这完全取决于它接收到的输入信号的强度。通常,如果所有输入信号的总强度足够强,神经元就会开始放电;否则,它将保持空闲状态。粗略地说,神经元会“加总”它接收到的所有输入,如果总和足够大,它就会开始放电。这种描述的一个重要改进是,实际上存在两种类型的输入,称为兴奋性输入和抑制性输入。正如您所期望的那样,兴奋性输入的强度会被加总,但抑制性输入则会从总和中减去——因此,强的抑制性输入往往会阻止神经元放电。
One simple way of describing these signals is to say that at any given moment a neuron is either “idle” or “firing.” When it's idle, a neuron isn't transmitting any signals; when it's firing, a neuron sends frequent bursts of signals through all of its outgoing connections. How does a neuron decide when to fire? It all depends on the strength of the incoming signals it is receiving. Typically, if the total of all incoming signals is strong enough, the neuron will start firing; otherwise, it will remain idle. Roughly speaking, then, the neuron “adds up” all of the inputs it is receiving and starts firing if the sum is large enough. One important refinement of this description is that there are actually two types of inputs, called excitatory and inhibitory. The strengths of the excitatory inputs are added up just as you would expect, but the inhibitory inputs are instead subtracted from the total—so a strong inhibitory input tends to prevent the neuron from firing.
解决雨伞问题的神经网络
A Neural Network for the Umbrella Problem
人工神经网络是一种计算机模型,它代表了大脑的一小部分,其运算操作高度简化。我们将首先讨论一个基础版本的人工神经网络,它能够很好地解决前面提到的雨伞问题。之后,我们将使用一个具有更复杂特征的神经网络来解决一个被称为“太阳镜问题”的问题。
An artificial neural network is a computer model that represents a tiny fraction of a brain, with highly simplified operations. We'll initially discuss a basic version of artificial neural networks, which works well for the umbrella problem considered earlier. After that, we'll use a neural network with more sophisticated features to tackle a problem called the “sunglasses problem.”
我们基本模型中的每个神经元都被分配了一个数字,称为阈值。模型运行时,每个神经元会将其接收到的输入相加。如果输入的总和至少等于阈值,则神经元触发,否则保持空闲状态。下一页的图展示了一个用于解决前面讨论的极其简单的伞形问题的神经网络。左侧,我们有三个网络输入。你可以将它们想象成类似于动物大脑中的感觉输入。正如我们的眼睛和耳朵触发电信号和化学信号并发送到大脑中的神经元一样,图中的三个输入会向人工神经网络中的神经元发送信号。该网络中的三个输入都是兴奋性的。如果每个输入对应的条件为真,则发送强度为 +1 的信号。例如,如果当前是多云天气,那么标记为“多云?”的输入会发送强度为 +1 的兴奋信号;否则,它什么也不发送,相当于强度为零的信号。
Each neuron in our basic model is assigned a number called its threshold. When the model is running, each neuron adds up the inputs it is receiving. If the sum of the inputs is at least as large as the threshold, the neuron fires, and otherwise it remains idle. The figure on the next page shows a neural network for the extremely simple umbrella problem considered earlier. On the left, we have three inputs to the network. You can think of these as being analogous to the sensory inputs in an animal brain. Just as our eyes and ears trigger electrical and chemical signals that are sent to neurons in our brains, the three inputs in the figure send signals to the neurons in the artificial neural network. The three inputs in this network are all excitatory. Each one transmits a signal of strength +1 if its corresponding condition is true. For example, if it is currently cloudy, then the input labeled “cloudy?” sends out an excitatory signal of strength +1; otherwise, it sends nothing, which is equivalent to a signal of strength zero.
如果我们忽略输入和输出,这个网络只有两个神经元,每个神经元都有不同的阈值。输入为湿度和云量的神经元只有当其两个输入都激活时才会触发(即其阈值为 2);而另一个神经元只要其中一个输入激活就会触发(即其阈值为 1)。其效果如上一页图的底部所示,从中您可以看到最终输出如何根据输入变化。
If we ignore the inputs and outputs, this network has only two neurons, each with a different threshold. The neuron with inputs for humidity and cloudiness fires only if both of its inputs are active (i.e., its threshold is 2), whereas the other neuron fires if any one of its inputs is active (i.e., its threshold is 1). The effect of this is shown in the bottom of the figure on the previous page, where you can see how the final output can change depending on the inputs.
多云,但不潮湿也不下雨
cloudy, but neither humid nor raining
上图:用于解决雨伞问题的神经网络。下图:运行中的雨伞神经网络。“激活”的神经元、输入和输出均以阴影表示。中间图中,输入表示没有下雨,但天气潮湿且阴天,因此做出带伞的决定。下图中,唯一有效的输入是“阴天?”,这直接影响到不带伞的决定。
Top panel: A neural network for the umbrella problem. Bottom two panels: The umbrella neural network in operation. Neurons, inputs, and outputs that are “firing” are shaded. In the center panel, the inputs state that it is not raining, but it is both humid and cloudy, resulting in a decision to take an umbrella. In the bottom panel, the only active input is “cloudy?,” which feeds through to a decision not to take an umbrella.
神经网络需要“识别”的人脸。实际上,我们处理的不是人脸识别,而是更简单的问题:判断人脸是否佩戴太阳镜。资料来源:Tom Mitchell,《机器学习》,麦格劳-希尔出版社(1998 年)。经许可使用。
Faces to be “recognized” by a neural network. In fact, instead of recognizing faces, we will tackle the simpler problem of determining whether a face is wearing sunglasses. Source: Tom Mitchell, Machine Learning, McGraw-Hill (1998). Used with permission.
此时,不妨回顾一下第90页中关于伞形问题的决策树。事实证明,当输入相同时,决策树和神经网络会产生完全相同的结果。对于这个非常简单的人工问题,决策树或许是更合适的表示。但接下来我们将讨论一个更复杂的问题,它将展示神经网络的真正威力。
At this point, it would be well worth your while to look back at the decision tree for the umbrella problem on page 90. It turns out that the decision tree and the neural network produce exactly the same results when given the same inputs. For this very simple, artificial problem, the decision tree is probably a more appropriate representation. But we will next look at a much more complex problem that demonstrates the true power of neural networks.
解决太阳镜问题的神经网络
A Neural Network for the Sunglasses Problem
作为一个可以用神经网络成功解决的现实问题的例子,我们将解决一个名为“太阳镜问题”的任务。该问题的输入是一个低分辨率人脸照片数据库。数据库中的人脸姿势多种多样:有些直视镜头,有些抬头,有些向左或向右看,还有一些戴着太阳镜。上图显示了一些示例。
As an example of a realistic problem that can be successfully solved using neural networks, we'll be tackling a task called the “sunglasses problem.” The input to this problem is a database of low-resolution photographs of faces. The faces in the database appear in a variety of configurations: some of them look directly at the camera, some look up, some look to the left or right, and some are wearing sunglasses. The figure above shows some examples.
我们特意使用低分辨率图像,以便于我们的神经网络更容易描述。实际上,每幅图像的宽度和高度都只有 30 像素。然而,正如我们即将看到的,神经网络在如此粗略的输入下也能产生令人惊讶的良好结果。
We are deliberately working with low-resolution images here, to make our neural networks easy to describe. Each of these images is, in fact, only 30 pixels wide and 30 pixels high. As we will soon see, however, a neural network can produce surprisingly good results with such coarse inputs.
神经网络可以用来对这张人脸数据库进行标准的人脸识别,即确定照片中人物的身份,无论该人物是正看着相机还是戴着墨镜伪装。但在这里,我们将解决一个更简单的问题,这将更清楚地展示神经网络的特性。我们的目标是判断一张给定的人脸是否戴着墨镜。
Neural networks can be used to perform standard face recognition on this face database—that is, to determine the identity of the person in a photograph, regardless of whether the person is looking at the camera or disguised with sunglasses. But here, we will attack an easier problem, which will demonstrate the properties of neural networks more clearly. Our objective will be to decide whether or not a given face is wearing sunglasses.
用于解决太阳镜问题的神经网络。
A neural network for the sunglasses problem.
上图展示了网络的基本结构。该图只是示意性的,因为它并未展示实际网络中的每个神经元或连接。最明显的特征是右侧的单个输出神经元,如果输入图像包含太阳镜,则输出 1,否则输出 0。在网络的中心,我们看到三个神经元,它们直接从输入图像接收信号,并将信号发送到输出神经元。网络最复杂的部分位于左侧,我们可以看到从输入图像到中心神经元的连接。虽然没有显示所有连接,但实际网络确实存在从输入图像中的每个像素到每个中心神经元的连接。一些简单的计算就会告诉你,这会导致相当多的连接。回想一下,我们使用的是宽 30 像素、高 30 像素的低分辨率图像。因此,即使这些以现代标准来看很小的图像,也包含 30 × 30 = 900 像素。并且有三个中央神经元,导致该网络左侧层共有 3 × 900 = 2700 个连接。
The figure above shows the basic structure of the network. This figure is schematic, since it doesn't show every neuron or every connection in the actual network used. The most obvious feature is the single output neuron on the right, which produces a 1 if the input image contains sunglasses and a 0 otherwise. In the center of the network, we see three neurons that receive signals directly from the input image and send signals on to the output neuron. The most complicated part of the network is on the left, where we see the connections from the input image to the central neurons. Although all the connections aren't shown, the actual network has a connection from every pixel in the input image to every central neuron. Some quick arithmetic will show you that this leads to a rather large number of connections. Recall that we are using low-resolution images that are 30 pixels wide and 30 pixels high. So even these images, which are tiny by modern standards, contain 30 × 30 = 900 pixels. And there are three central neurons, leading to a total of 3 × 900 = 2700 connections in the left-hand layer of this network.
这个网络的结构是如何确定的?神经元的连接方式可能有所不同吗?答案是肯定的,有很多不同的网络结构可以很好地解决太阳镜问题。网络结构的选择通常基于以往的经验,即哪种结构效果良好。我们再次看到,使用模式识别系统需要洞察力和直觉。
How was the structure of this network determined? Could the neurons have been connected differently? The answer is yes, there are many different network structures that would give good results for the sunglasses problem. The choice of a network structure is often based on previous experience of what works well. Once again, we see that working with pattern recognition systems requires insight and intuition.
不幸的是,正如我们很快会看到的,我们选择的网络中的 2700 个连接,每一个都需要以某种方式进行“调整”,才能使网络正常运行。我们该如何应对这种涉及调整数千个不同连接的复杂性呢?答案是,调整是通过从训练样本中学习自动完成的。
Unfortunately, as we shall soon see, each of the 2700 connections in the network we have chosen needs to be “tuned” in a certain way before the network will operate correctly. How can we possibly handle this complexity, which involves tuning thousands of different connections? The answer will turn out to be that the tuning is done automatically, by learning from training examples.
添加加权信号
Adding Weighted Signals
如前所述,我们用于解决雨伞问题的网络使用的是基础版人工神经网络。对于太阳镜问题,我们将添加三项重要的增强功能。
As mentioned earlier, our network for the umbrella problem used a basic version of artificial neural networks. For the sunglasses problem, we'll be adding three significant enhancements.
增强功能 1:信号可以取 0 到 1 之间的任何值(含 0 和 1)。这与伞状网络形成对比,伞状网络的输入和输出信号被限制为 0 或 1,不能取任何中间值。换句话说,我们新网络中的信号值可以是 0.0023 或 0.755。为了更具体说明,请思考一下太阳镜的示例。输入图像中像素的亮度对应于通过该像素连接发送的信号值。因此,全白像素发送的信号值为 1,而全黑像素发送的信号值为 0。不同深浅的灰色会对应 0 到 1 之间的值。
Enhancement 1: Signals can take any value between 0 and 1 inclusive. This contrasts with the umbrella network, in which the input and output signals were restricted to equal 0 or 1 and could not take any intermediate values. In other words, signal values in our new network can be, for example, 0.0023 or 0.755. To make this concrete, think about our sunglasses example. The brightness of a pixel in an input image corresponds to the signal value sent over that pixel's connections. So a pixel that is perfectly white sends a value of 1, whereas a perfectly black pixel sends a value of 0. The various shades of gray result in corresponding values between 0 and 1.
增强功能 2:总输入由加权和计算得出。在伞状网络中,神经元将它们的输入相加,而不会进行任何改变。然而,在实践中,神经网络会考虑到每个连接可能具有不同的强度。连接的强度用一个称为连接权重的数字表示。权重可以是任何正数或负数。较大的正权重(例如 51.2)表示强兴奋性连接——当信号通过这样的连接时,其下游神经元很可能会触发。较大的负权重(例如 -121.8)表示强抑制性连接——此类连接上的信号可能会导致下游神经元保持空闲状态。权重较小的连接(例如 0.03 或 -0.0074)对其下游神经元是否触发几乎没有影响。 (实际上,一个权重的“大”或“小”仅指与其他权重的比较,因此,此处给出的数值示例只有假设它们连接到同一个神经元才有意义。)当一个神经元计算其输入的总和时,每个输入信号都会先乘以其连接的权重,然后再加到总和中。因此,大权重比小权重的影响更大,并且兴奋性信号和抑制性信号可能会相互抵消。
Enhancement 2: Total input is computed from a weighted sum. In the umbrella network, neurons added up their inputs without altering them in any way. In practice, however, neural networks take into account that every connection can have a different strength. The strength of a connection is represented by a number called the connection's weight. A weight can be any positive or negative number. Large positive weights (e.g., 51.2) represent strong excitatory connections—when a signal passes through a connection like this, its downstream neuron is likely to fire. Large negative weights (e.g., -121.8) represent strong inhibitory connections—a signal on this type of connection will probably cause the downstream neuron to remain idle. Connections with small weights (e.g., 0.03 or -0.0074) have little influence on whether their downstream neurons fire. (In reality, a weight is defined as “large” or “small” only in comparison to other weights, so the numerical examples given here only make sense if we assume they are on connections to the same neuron.) When a neuron computes the total of its inputs, each input signal is multiplied by the weight of its connection before being added to the total. So large weights have more influence than small ones, and it is possible for excitatory and inhibitory signals to cancel each other out.
增强 3:阈值效应减弱。阈值不再限制其神经元的输出为完全开启(即 1)或完全关闭(即 0);输出可以是 0 到 1 之间的任何值(含 0 和 1)。当总输入远低于阈值时,输出接近 0;当总输入远高于阈值时,输出接近 1。但接近阈值的总输入可能会产生接近 0.5 的中间输出值。例如,考虑一个阈值为 6.2 的神经元。输入为 122 可能会产生 0.995 的输出,因为输入远大于阈值。但输入为 6.1 接近阈值,可能会产生 0.45 的输出。这种效应发生在所有神经元上,包括最终的输出神经元。在我们的太阳镜应用中,这意味着接近 1 的输出值强烈暗示太阳镜的存在,而接近 0 的输出值强烈暗示太阳镜不存在。
Enhancement 3: The effect of the threshold is softened. A threshold no longer clamps its neuron's output to be either fully on (i.e., 1) or fully off (i.e., 0); the output can be any value between 0 and 1 inclusive. When the total input is well below the threshold, the output is close to 0, and when the total input is well above the threshold, the output is close to 1. But a total input near the threshold can produce an intermediate output value near 0.5. For example, consider a neuron with threshold 6.2. An input of 122 might produce an output of 0.995, since the input is much greater than the threshold. But an input of 6.1 is close to the threshold and might produce an output of 0.45. This effect occurs at all neurons, including the final output neuron. In our sunglasses application, this means that output values near 1 strongly suggest the presence of sunglasses, and output values near 0 strongly suggest their absence.
信号在相加之前先乘以连接权重。
Signals are multiplied by a connection weight before being summed.
上图展示了我们新型人工神经元的全部三种增强功能。该神经元接收三个像素的输入:一个亮像素(信号 0.9),一个中亮像素(信号 0.6)和一个较暗像素(信号 0.4)。这些像素与神经元的连接权重分别为 10、0.5 和 -3。这些信号乘以权重,然后相加,最终为神经元产生 8.1 的总输入信号。由于 8.1 明显大于神经元的阈值 2.5,因此输出非常接近 1。
The figure above demonstrates our new type of artificial neuron with all three enhancements. This neuron receives inputs from three pixels: a bright pixel (signal 0.9), a medium-bright pixel (signal 0.6), and a darker pixel (signal 0.4). The weights of these pixels' connections to the neuron happen to be 10, 0.5, and -3, respectively. The signals are multiplied by the weights and then added up, which produces a total incoming signal for the neuron of 8.1. Because 8.1 is significantly larger than the neuron's threshold of 2.5, the output is very close to 1.
通过学习调整神经网络
Tuning a Neural Network by Learning
现在我们来定义一下“调整人工神经网络”的含义。首先,每个连接(记住,可能有成千上万个这样的连接)必须将其权重设置为一个正值(兴奋性),也可以将其设置为负值(抑制性)。其次,每个神经元必须将其阈值设置为一个合适的值。你可以将权重和阈值想象成网络上的小刻度盘,每个刻度盘都可以像电灯开关上的调光器一样进行调节。
Now we are ready to define what it means to tune an artificial neural network. First, every connection (and remember, there could be many thousands of these) must have its weight set to a value that could be positive (excitatory) or negative (inhibitory). Second, every neuron must have its threshold set to an appropriate value. You can think of the weights and thresholds as being small dials on the network, each of which can be turned up and down like a dimmer on an electric light switch.
当然,手动设置这些刻度盘会非常耗时。我们可以在学习阶段使用计算机来设置刻度盘。最初,刻度盘被设置为随机值。(这看起来可能过于随意,但这正是专业人士在实际应用中所做的。)然后,向计算机提供它的第一个训练样本。在我们的应用中,这将是一张可能戴或不戴太阳镜的人的照片。该样本在网络中运行,产生介于 0 和 1 之间的单个输出值。但是,由于该样本是训练样本,我们知道网络理想情况下应该产生的“目标”值。关键技巧是稍微改变网络,使其输出更接近所需的目标值。例如,假设第一个训练样本恰好包含太阳镜。那么目标值为 1。因此,整个网络中的每个刻度盘都会进行微小的调整,使网络的输出值向目标值 1 靠拢。如果第一个训练样本不包含太阳镜,则每个刻度盘都会向相反方向移动微小的量,使输出值向目标值 0 靠拢。您可能立即就能看出这个过程是如何进行的。网络依次接受每个训练样本,并调整每个刻度盘以提高网络的性能。在多次运行所有训练样本后,网络通常会达到良好的性能水平,学习阶段结束时刻度盘仍保持当前设置。
To set these dials by hand would, of course, be prohibitively time-consuming. Instead, we can use a computer to set the dials during a learning phase. Initially, the dials are set to random values. (This may seem excessively arbitrary, but it is exactly what professionals do in real applications.) Then, the computer is presented with its first training sample. In our application, this would be a picture of a person who may or may not be wearing sunglasses. This sample is run through the network, which produces a single output value between 0 and 1. However, because the sample is a training sample, we know the “target” value that the network should ideally produce. The key trick is to alter the network slightly so that its output is closer to the desired target value. Suppose, for example, that the first training sample happens to contain sunglasses. Then the target value is 1. Therefore, every dial in the entire network is adjusted by a tiny amount, in the direction that will move the network's output value toward the target of 1. If the first training sample did not contain sunglasses, every dial would be moved a tiny amount in the opposite direction, so that the output value moves toward the target 0. You can probably see immediately how this process continues. The network is presented with each training sample in turn, and every dial is adjusted to improve the performance of the network. After running through all of the training samples many times, the network typically reaches a good level of performance and the learning phase is terminated with the dials at the current settings.
如何计算这些对刻度盘进行微小调整的细节实际上相当重要,但它们需要一些超出本书范围的数学知识。我们需要的工具是多变量微积分,这门课程通常作为大学中级数学课程来教授。是的,数学很重要!另外,请注意,本文描述的方法(专家称之为“随机梯度下降”)只是众多公认的神经网络训练方法之一。
The details of how to calculate these tiny adjustments to the dials are actually rather important, but they require some math that is beyond the scope of this book. The tool we need is multivariable calculus, which is typically taught as a mid-level college math course. Yes, math is important! Also, note that the approach described here, which experts call “stochastic gradient descent,” is just one of many accepted methods for training neural networks.
所有这些方法都大同小异,所以我们先来集中讨论一下整体情况:神经网络的学习阶段相当繁琐,需要反复调整所有权重和阈值,直到网络在训练样本上表现良好。然而,所有这些都可以由计算机自动完成,最终的网络能够以简单高效的方式对新样本进行分类。
All these methods have the same flavor, so let's concentrate on the big picture: the learning phase for a neural network is rather laborious, involving repeated adjustment of all the weights and thresholds until the network performs well on the training samples. However, all this can be done automatically by a computer, and the result is a network that can be used to classify new samples in a simple and efficient manner.
让我们看看这在太阳镜应用中是如何实现的。学习阶段完成后,从输入图像到中枢神经元的数千个连接中的每一个都被分配了一个数值权重。如果我们专注于从所有像素到其中一个神经元(比如最上面的那个)的连接,我们可以通过将它们转换为图像,以一种非常方便的方式将这些权重可视化。下一页的图表显示了其中一个中枢神经元的权重可视化。在这个特定的可视化中,强兴奋性连接(即具有较大的正权重)显示为白色,强抑制性连接(即具有较大的负权重)显示为黑色。中等强度的连接使用各种深浅的灰色。每个权重都显示在其对应的像素位置。仔细观察这张图。在太阳镜通常出现的区域中,有一条非常明显的强抑制性权重带——事实上,你几乎可以确信这张权重图像实际上包含一张太阳镜的图片。我们可以将其称为太阳镜的“幽灵”,因为它们并不代表任何特定存在的太阳镜。
Let's see how this works out for the sunglasses application. Once the learning phase has been completed, every one of the several thousand connections from the input image to the central neurons has been assigned a numerical weight. If we concentrate on the connections from all pixels to just one of the neurons (say, the top one), we can visualize these weights in a very convenient way, by transforming them into an image. This visualization of the weights is shown in the figure on the next page, for just one of the central neurons. For this particular visualization, strong excitatory connections (i.e., with large positive weights) are white, and strong inhibitory connections (i.e., with large negative weights) are black. Various shades of gray are used for connections of intermediate strength. Each weight is shown in its corresponding pixel location. Take a careful look at the figure. There is a very obvious swath of strong inhibitory weights in the region where sunglasses would typically appear—in fact, you could almost convince yourself that this image of weights actually contains a picture of some sunglasses. We might call this a “ghost” of sunglasses, since they don't represent any particular sunglasses that exist.
太阳镜网络中一个中央神经元的输入权重(即强度)。
Weights (i.e., strengths) of inputs to one of the central neurons in the sunglasses network.
考虑到权重的设置并未使用任何人类提供的关于太阳镜典型颜色和位置的知识,这种“幽灵”的出现就显得尤为引人注目。人类提供的唯一信息是一组训练图像,每幅图像都用简单的“是”或“否”来指示是否存在太阳镜。太阳镜“幽灵”在学习阶段通过反复调整权重自动出现。
The appearance of this ghost is rather remarkable when you consider that the weights were not set using any human-provided knowledge about the typical color and location of sunglasses. The only information provided by humans was a set of training images, each with a simple “yes” or “no” to specify whether sunglasses were present. The ghost of sunglasses emerged automatically from the repeated adjustment of the weights in the learning phase.
另一方面,图像的其他部分显然有很多权重很高的因素,理论上来说,这些因素应该不会对太阳镜的决策产生影响。那么,我们该如何解释这些毫无意义、看似随机的连接呢?我们在这里遇到了人工智能研究人员在过去几十年中学到的最重要的经验之一:看似智能的行为有可能从看似随机的系统中产生。从某种程度上来说,这并不奇怪。如果我们有能力进入自己的大脑,分析神经元之间连接的强度,绝大多数连接看起来都是随机的。然而,当这些杂乱无章的连接强度集合作为一个整体发挥作用时,它们却产生了我们自身的智能行为!
On the other hand, it's clear that there are plenty of strong weights in other parts of the image, which should—in theory—have no impact on the sunglasses decision. How can we account for these meaningless, apparently random, connections? We have encountered here one of the most important lessons learned by artificial intelligence researchers in the last few decades: it is possible for seemingly intelligent behavior to emerge from seemingly random systems. In a way, this should not be surprising. If we had the ability to go into our own brains and analyze the strength of the connections between the neurons, the vast majority would appear random. And yet, when acting as an ensemble, these ramshackle collections of connection strengths produce our own intelligent behavior!
太阳镜网络的结果。来源:Tom Mitchell,《机器学习》,麦格劳-希尔出版社(1998)。经许可使用。
Results from the sunglasses network. Source: Tom Mitchell, Machine Learning, McGraw-Hill (1998). Used with permission.
使用太阳镜网络
Using the Sunglasses Network
现在我们使用的网络可以输出 0 到 1 之间的任意值,你可能想知道我们如何得到最终答案——这个人戴不戴太阳镜?正确的方法出奇地简单:高于 0.5 的输出被视为“戴太阳镜”,而低于 0.5 的输出则被视为“不戴太阳镜”。
Now that we are using a network that can output any value between 0 and 1, you may be wondering how we get a final answer—is the person wearing sunglasses or not? The correct technique here is surprisingly simple: an output above 0.5 is treated as “sunglasses,” while an output below 0.5 yields “no sunglasses.”
为了测试我们的太阳镜网络,我进行了一项实验。人脸数据库包含大约 600 张图片,因此我使用了 400 张图片来学习网络,然后在剩余的 200 张图片上测试网络的性能。在这项实验中,太阳镜网络的最终准确率约为 85%。换句话说,在大约 85% 的从未见过的图片上,该网络对“这个人戴太阳镜吗?”这个问题给出了正确的答案。上图显示了一些分类正确和错误的图像。研究模式识别算法失败的例子总是令人着迷,这个神经网络也不例外。图中右侧面板中有一两张分类错误的图像确实是难以识别的例子,即使是人类也可能觉得模棱两可。然而,至少有一张(右侧面板左上角的图像)在我们人类看来是绝对明显的——一个男人直视镜头,而且显然戴着太阳镜。在模式识别任务中,偶尔出现这种神秘的失败并不罕见。
To test our sunglasses network, I ran an experiment. The face database contains about 600 images, so I used 400 images for learning the network and then tested the performance of the network on the remaining 200 images. In this experiment, the final accuracy of the sunglasses network turned out to be around 85%. In other words, the network gives a correct answer to the question “is this person wearing sunglasses?” on about 85% of images that it has never seen before. The figure above shows some of the images that were classified correctly and incorrectly. It's always fascinating to examine the instances on which a pattern recognition algorithm fails, and this neural network is no exception. One or two of the incorrectly classified images in the right panel of the figure are genuinely difficult examples that even a human might find ambiguous. However, there is at least one (the top left image in the right panel) that appears, to us humans, to be absolutely obvious—a man staring straight at the camera and clearly wearing sunglasses. Occasional mysterious failures of this type are not at all unusual in pattern recognition tasks.
当然,最先进的神经网络在这个问题上可以达到远高于 85% 的正确率。本文的重点是使用一个简单的网络,以便理解其中涉及的主要思想。
Of course, state-of-the-art neural networks could achieve much better than 85% correctness on this problem. The focus here has been on using a simple network, in order to understand the main ideas involved.
模式识别:过去、现在和未来
PATTERN RECOGNITION: PAST, PRESENT, AND FUTURE
如前所述,模式识别是人工智能(AI)这一更广阔领域的一个方面。模式识别处理的是音频、照片和视频等高度可变的输入数据,而人工智能则涵盖了更为多样化的任务,例如计算机象棋、在线聊天机器人和人形机器人。
As mentioned earlier, pattern recognition is one aspect of the larger field of artificial intelligence, or AI. Whereas pattern recognition deals with highly variable input data such as audio, photos, and video, AI includes more diverse tasks, including computer chess, online chat-bots, and humanoid robotics.
人工智能的开端轰动一时:1956年,在达特茅斯学院举行的一次会议上,十位科学家实际上创立了这一领域,并首次推广了“人工智能”一词。会议组织者提交给洛克菲勒基金会的资助提案中,大胆地写道,他们的讨论将“基于这样的猜想:学习的每个方面或智能的任何其他特征,原则上都可以被精确地描述,从而制造出能够模拟它的机器。”
AI started off with a bang: at a conference at Dartmouth College in 1956, a group of ten scientists essentially founded the field, popularizing the very phrase “artificial intelligence” for the first time. In the bold words of the funding proposal for the conference, which its organizers sent to the Rockefeller Foundation, their discussions would “proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.”
达特茅斯会议前景光明,但随后的几十年却收效甚微。研究人员一直坚信,通往真正“智能”机器的关键突破就在眼前,但随着原型机不断产生机械行为,他们的希望一次又一次破灭。即使是神经网络的进步也无济于事:在经历了几次充满希望的爆发之后,科学家们最终还是碰壁了,最终还是陷入了机械行为的泥潭。
The Dartmouth conference promised much, but the subsequent decades delivered little. The high hopes of researchers, perennially convinced that the key breakthrough to genuinely “intelligent” machines was just over the horizon, were repeatedly dashed as their prototypes continued to produce mechanistic behavior. Even advances in neural networks did little to change this: after various bursts of promising activity, scientists ran up against the same brick wall of mechanistic behavior.
然而,人工智能正在缓慢但坚定地蚕食那些可能被定义为人类独有的思维过程。多年来,许多人认为人类国际象棋冠军的直觉和洞察力将击败任何计算机程序,因为计算机程序必然依赖于一套确定性的规则,而非直觉。然而,1997年,IBM的“深蓝”计算机击败了世界冠军加里·卡斯帕罗夫,这道看似人工智能的绊脚石被彻底根除。
Slowly but surely, however, AI has been chipping away at the collection of thought processes that might be defined as uniquely human. For years, many believed that the intuition and insight of human chess champions would beat any computer program, which must necessarily rely on a deterministic set of rules rather than intuition. Yet this apparent stumbling block for AI was convincingly eradicated in 1997, when IBM's Deep Blue computer beat world champion Garry Kasparov.
与此同时,人工智能的成功也逐渐渗透到普通人的生活。通过语音识别为客户提供服务的自动电话系统已成为常态。电子游戏中由计算机控制的对手开始展现出类似人类的策略,甚至包括性格特征和弱点。亚马逊和Netflix等在线服务开始根据自动推断的个人偏好推荐商品,结果往往令人惊喜。
Meanwhile, the success stories of AI were gradually creeping into the lives of ordinary people too. Automated telephone systems, servicing customers through speech recognition, became the norm. Computer-controlled opponents in video games began to exhibit human-like strategies, even including personality traits and foibles. Online services such as Amazon and Netflix began to recommend items based on automatically inferred individual preferences, often with surprisingly pleasing results.
事实上,人工智能的进步从根本上改变了我们对这些任务的认知。设想一下 1990 年的一项任务,它毫无疑问需要人类的智能投入,而人类实际上会因其专业知识而获得报酬:规划一趟多站飞机旅行的行程。1990 年,一个优秀的人类旅行社可以在寻找便捷且低成本的行程方面发挥巨大作用。然而,到了 2010 年,计算机在这项任务上的表现已经超过了人类。计算机究竟如何做到这一点本身就是一个有趣的故事,因为它们确实使用了一些引人入胜的算法来规划行程。但更重要的是这些系统对我们对这项任务的认知的影响。我认为,到 2010 年,绝大多数人认为规划行程的任务纯粹是机械的——这与 20 年前的看法形成了鲜明对比。
Indeed, our very perceptions of these tasks have been fundamentally altered by the progress of artificial intelligence. Consider a task that, in 1990, indisputably required the intelligent input of humans, who would actually be paid for their expertise: planning the itinerary of a multistop plane trip. In 1990, a good human travel agent could make a huge difference in finding a convenient and low-cost itinerary. By 2010, however, this task was performed better by computers than humans. Exactly how computers achieve this would be an interesting story in itself, as they do use several fascinating algorithms for planning itineraries. But even more important is the effect of the systems on our perception of the task. I would argue that by 2010, the task of planning an itinerary was perceived as purely mechanistic by a significant majority of humans—in stark contrast to the perception 20 years earlier.
这种任务从看似直观到明显机械化的逐渐转变仍在继续。无论是通用的人工智能,还是具体的模式识别,都在缓慢地扩展其应用范围并提升其性能。本章描述的算法——最近邻分类器、决策树和神经网络——可以应用于各种各样的实际问题。这些应用包括纠正手机虚拟键盘上粗手的文本输入、根据复杂的测试结果帮助诊断患者的病情、识别自动收费站的车牌以及确定向特定计算机用户显示哪些广告——仅举几例。因此,这些算法是模式识别系统的一些基本组成部分。无论您是否认为它们真正“智能”,您都可以期待在未来几年看到更多这样的算法。
This gradual transformation of tasks, from apparently intuitive to obviously mechanistic, is continuing. Both AI in general and pattern recognition in particular are slowly extending their reach and improving their performance. The algorithms described in this chapter—nearest-neighbor classifiers, decision trees, and neural networks—can be applied to an immense range of practical problems. These include correcting fat-fingered text entry on cell phone virtual keyboards, helping to diagnose a patient's illness from a complex battery of test results, recognizing car license plates at automated toll booths, and determining which advertisement to display to a particular computer user—to name just a few. Thus, these algorithms are some of the fundamental building blocks of pattern recognition systems. Whether or not you consider them to be truly “intelligent,” you can expect to see a lot more of them in the years ahead.
7
7
数据压缩:一劳永逸
Data Compression: Something for Nothing
—简· A·奥斯丁,《艾玛》
—JANE AUSTEN, Emma
我们都熟悉压缩物体的概念:当你试图把很多衣服塞进一个小行李箱时,你可以把它们挤压成足够小的尺寸,即使它们在正常大小下会溢出行李箱。你已经压缩了衣服。之后,你可以把它们从行李箱里拿出来,然后解压,(希望)能够以原来的尺寸和形状再次穿上它们。
We're all familiar with the idea of compressing physical objects: when you try to fit a lot of clothes into a small suitcase, you can squash the clothes so that they are small enough to fit even though they would overflow the suitcase at their normal size. You have compressed the clothes. Later, you can decompress the clothes after they come out of a suitcase and (hopefully) wear them again in their original size and shape.
值得注意的是,信息也可以实现同样的效果:计算机文件和其他类型的数据通常可以压缩到较小的尺寸,以便于存储或传输。之后,它们可以被解压并以原始形式使用。
Remarkably, it's possible to do exactly the same thing with information: computer files and other kinds of data can often be compressed to a smaller size for easy storage or transportation. Later, they are decompressed and used in their original form.
大多数人的电脑上都有足够的磁盘空间,无需费心压缩文件。因此,人们很容易认为压缩不会对我们大多数人造成影响。但这种想法是错误的:事实上,压缩在计算机系统的后台经常被使用。例如,许多通过互联网发送的消息在用户不知情的情况下被压缩,几乎所有软件都是以压缩格式下载的——这意味着您的下载和文件传输速度通常会比平时快几倍。甚至您在打电话时,您的声音也会被压缩:如果电话公司在传输语音数据之前对其进行压缩,则可以大大提高其资源的利用率。
Most people have plenty of disk space on their own computers and don't need to bother about compressing their own files. So it's tempting to think that compression doesn't affect most of us. But this impression is wrong: in fact, compression is used behind the scenes in computer systems quite often. For example, many of the messages sent over the internet are compressed without the user even knowing it, and almost all software is downloaded in compressed form—this means your downloads and file transfers are often several times quicker than they otherwise would be. Even your voice gets compressed when you speak on the phone: telephone companies can achieve a vastly superior utilization of their resources if they compress voice data before transporting it.
压缩的用途也更加显而易见。流行的 ZIP 文件格式就采用了一种巧妙的压缩算法,本章将对此进行介绍。你可能也非常熟悉数字视频压缩的利弊权衡:高质量视频的文件大小会比低质量视频大得多。
Compression is used in more obvious ways, too. The popular ZIP file format employs an ingenious compression algorithm that will be described in this chapter. And you're probably very familiar with the trade-offs involved in compressing digital videos: a high-quality video has a much larger file size than a low-quality version of the same video.
无损压缩:终极免费午餐
LOSSLESS COMPRESSION: THE ULTIMATE FREE LUNCH
重要的是要认识到计算机使用两种截然不同的压缩类型:无损压缩和有损压缩。无损压缩是终极免费午餐,它确实能让你不劳而获。无损压缩算法可以获取数据文件,将其压缩至原始大小的一小部分,然后将其解压缩为完全相同的大小。相反,有损压缩在解压缩后会导致原始文件发生细微变化。我们稍后会讨论有损压缩,但现在让我们先集中讨论无损压缩。举一个无损压缩的例子,假设原始文件包含本书的文本。那么压缩和解压缩后得到的版本包含完全相同的文本——没有一个单词、空格或标点符号是不同的。在我们对这顿免费午餐感到兴奋之前,我需要补充一个重要的警告:无损压缩算法无法为每个文件都带来显著的空间节省。但好的压缩算法会在某些常见类型的文件上节省大量空间。
It's important to realize that computers use two very different types of compression: lossless and lossy. Lossless compression is the ultimate free lunch that really does give you something for nothing. A lossless compression algorithm can take a data file, compress it to a fraction of its original size, then later decompress it to exactly the same thing. In contrast, lossy compression leads to slight changes in the original file after decompression takes place. We'll discuss lossy compression later, but let's focus on lossless compression for now. For an example of lossless compression, suppose the original file contained the text of this book. Then the version you get after compressing and decompressing contains exactly the same text—not a single word, space, or punctuation character is different. Before we get too excited about this free lunch, I need to add an important caveat: lossless compression algorithms can't produce dramatic space savings on every file. But a good compression algorithm will produce substantial savings on certain common types of files.
那么,我们如何才能享用这顿“免费午餐”呢?究竟该如何才能在不破坏数据或信息的情况下,将其压缩到比其实际“真实”大小更小,以便之后能够完美地重建?事实上,人类一直在不假思索地这样做。以你的周日历为例。为了简单起见,假设你每周工作五天,每天工作八小时,并将日历划分为每小时的时间段。因此,五天中的每一天都有八个可能的时间段,每周总共有40个时间段。粗略地说,要将你一周的日历告知他人,你必须传达40条信息。但是,如果有人打电话给你安排下周的会议,你会用40条单独的信息来描述你的空闲时间吗?当然不会!你很可能会说:“周一和周二已经满了,周四和周五下午1点到3点我也有空,但其他时间可以安排。” 这就是无损数据压缩的一个例子!与您交谈的人可以准确地重建您下周所有 40 个时段的可用性,但您不必明确列出它们。
So how can we get our hands on this free lunch? How on earth can you make a piece of data, or information, smaller than its actual “true” size without destroying it, so that everything can be reconstructed perfectly later on? In fact, humans do this all the time without even thinking about it. Consider the example of your weekly calendar. To keeps things simple, let's assume you work eight-hour days, five days a week, and that you divide your calendar into one-hour slots. So each of the five days has eight possible slots, for a total of 40 slots per week. Roughly speaking, then, to communicate a week of your calendar to someone else, you have to communicate 40 pieces of information. But if someone calls you up to schedule a meeting for next week, do you describe your availability by listing 40 separate pieces of information? Of course not! Most likely you will say something like “Monday and Tuesday are full, and I'm booked from 1 p.m. to 3 p.m. on Thursday and Friday, but otherwise available.” This is an example of lossless data compression! The person you are talking to can exactly reconstruct your availability in all 40 slots for next week, but you didn't have to list them explicitly.
说到这儿,你可能会觉得这种“压缩”有点作弊,因为它依赖于你的日程安排中大部分时间都相同。具体来说,周一和周二都排满了,所以你可以很快地描述出来,而除了两个同样容易描述的时间段外,本周剩余时间都可用。这的确是一个特别简单的例子。然而,计算机中的数据压缩也是这样运作的:其基本思想是找到数据中彼此相同的部分,并使用某种技巧来更有效地描述这些部分。
At this point, you might be thinking that this kind of “compression” is cheating, since it depends on the fact that huge chunks of your schedule were the same. Specifically, all of Monday and Tuesday were booked, so you could describe them very quickly, and the rest of the week was available except for two slots that were also easy to describe. It's true that this was a particularly simple example. Nevertheless, data compression in computers works this way too: the basic idea is to find parts of the data that are identical to each other and use some kind of trick to describe those parts more efficiently.
当数据包含重复项时,这尤其容易。例如,您可能想到一个压缩以下数据的好方法:
This is particularly easy when the data contains repetitions. For example, you can probably think of a good way to compress the following data:
AAAAAAAAAAAAAAAAAAAAAABCBCBCBCBCBCBCBCBCBCAAAAAADEFDEFDEF
AAAAAAAAAAAAAAAAAAAAABCBCBCBCBCBCBCBCBCBCAAAAAADEFDEFDEF
如果不太明显,想象一下你会如何通过电话口述这些数据给别人。我敢肯定,你不会说“A,A,A,A,……,D,E,F”,而是会想出类似“21 个 A,然后 10 个 BC,然后是 6 个 A,最后是 3 个 DEF”这样的话。或者,为了快速在纸上记下这些数据,你可能会写类似“21A,10BC,6A,3DEF”这样的内容。在这种情况下,你就把原始数据(恰好包含 56 个字符)压缩成了只有 16 个字符的字符串。这还不到原始数据的三分之一——还不错!计算机科学家将这种特殊的技巧称为游程编码,因为它用游程的“长度”来编码重复的“游程”。
If it's not obvious immediately, think about how you would dictate this data to someone over the phone. Instead of saying “A, A, A, A,…, D, E, F,” I'm sure you would come up with something more along the lines of “21 A's, then 10 BC's, then another 6 A's, then 3 DEF's.” Or to quickly make a note of this data on a piece of paper, you might write something like “21A,10BC,6A,3DEF.” In this case, you have compressed the original data, which happens to contain 56 characters, down to a string of only 16 characters. That's less than one-third of the size - not bad! Computer scientists call this particular trick run-length encoding, because it encodes a “run” of repetitions with the “length” of that run.
遗憾的是,游程编码仅适用于压缩特定类型的数据。它在实际应用中确实会用到,但通常只会与其他压缩算法结合使用。例如,传真机会将游程编码与另一种名为霍夫曼编码的技术结合使用,我们稍后会讲到这种技术。游程编码的主要问题在于数据中的重复部分必须相邻——换句话说,重复部分之间不能有其他数据。使用游程编码压缩 ABABAB 很容易(只有 3AB),但用同样的技巧压缩 ABXABYAB 却不可能。
Unfortunately, run-length encoding is only useful for compressing very specific types of data. It is used in practice, but mostly only in combination with other compression algorithms. For example, fax machines use run-length encoding in combination with another technique, called Huffman coding, that we will encounter later. The main problem with run-length encoding is that the repetitions in the data have to be adjacent—in other words, there must be no other data between the repeated parts. It's easy to compress ABABAB using run-length encoding (it's just 3AB), but impossible to compress ABXABYAB using the same trick.
您或许明白为什么传真机可以利用游程编码。传真从定义上来说就是黑白文档,它会被转换成大量的点,每个点要么是黑色,要么是白色。当您按顺序(从左到右,从上到下)读取这些点时,您会遇到长串的白点(背景)和短串的黑点(前景文本或手写)。这可以有效地利用游程编码。但如上所述,只有某些特定类型的数据才具有此功能。
You can probably see why fax machines can take advantage of run-length encoding. Faxes are by definition black-and-white documents, which get converted into a large number of dots, with each dot being either black or white. When you read the dots in order (left to right, top to bottom), you encounter long runs of white dots (the background) and short runs of black dots (the foreground text or handwriting). This leads to an efficient use of run-length encoding. But as mentioned above, only certain limited types of data have this feature.
因此,计算机科学家发明了一系列更复杂的技巧,它们基于相同的基本思想(查找重复并有效地描述它们),即使重复不相邻也能有效。这里,我们只讨论其中两个技巧:“与先前相同”技巧和“较短符号”技巧。这两个技巧是生成 ZIP 文件所需的唯一工具,而 ZIP 文件格式是个人电脑上最流行的压缩文件格式。因此,一旦您理解了这两个技巧背后的基本思想,您就能理解您的计算机在大多数情况下是如何进行压缩的。
So computer scientists have invented a range of more sophisticated tricks that use the same basic idea (find repetitions and describe them efficiently), but work well even if the repetitions aren't adjacent. Here, we'll look at only two of these tricks: the same-as-earlier trick and the shorter-symbol trick. These two tricks are the only things you need to produce ZIP files, and the ZIP file format is the most popular format for compressed files on personal computers. So once you understand the basic ideas behind these two tricks, you will understand how your own computer uses compression, most of the time.
与之前相同的技巧
The Same-as-Earlier Trick
想象一下,你被赋予了一项枯燥的任务,需要通过电话向别人口述以下数据:
Imagine you have been given the dreary task of dictating the following data over the telephone to someone else:
VJGDNQMYLH-KW-VJGDNQMYLH-ADXSGF-O-
VJGDNQMYLH-ADXSGF-VJGDNQMYLH-EW-ADXSGF
VJGDNQMYLH-KW-VJGDNQMYLH-ADXSGF-O-
VJGDNQMYLH-ADXSGF-VJGDNQMYLH-EW-ADXSGF
这里需要传达 63 个字符(顺便说一下,我们忽略了破折号——它们只是为了使数据更易于阅读而插入的)。除了一次听写所有 63 个字符之外,还有什么更好的方法吗?第一步可能是认识到这些数据中存在大量重复内容。事实上,大多数由破折号分隔的“块”至少重复了一次。因此,在听写这些数据时,您可以通过说“这部分与我之前告诉您的内容相同”之类的话来节省大量精力。更准确地说,您必须说明重复部分的时间早于什么时间以及有多长——也许可以这样说“返回 27 个字符,然后从该点复制 8 个字符”。
There are 63 characters to be communicated here (we are ignoring the dashes, by the way—they were only inserted to make the data easier to read). Can we do any better than dictating all 63 characters, one at a time? The first step might be to recognize that there is quite a lot of repetition in this data. In fact, most of the “chunks” that are separated by dashes get repeated at least once. So when dictating this data, you can save a lot of effort by saying something like “this part is the same as something I told you earlier.” To be a bit more precise, you will have to say how much earlier and how long the repeated part is—perhaps something like “go back 27 characters, and copy 8 characters from that point.”
让我们看看这个策略在实践中是如何运作的。前 12 个字符没有重复,所以你只能一个接一个地听写:“V、J、G、D、N、QM、Y、L、H、K、W”。但接下来的 10 个字符与之前的一些字符相同,所以你可以说“返回 12,复制 10”。接下来的七个字符是新的,要一个接一个地听写:“A、D、X、S、G、F、O”。但之后的 16 个字符是一个大的重复,所以你可以说“返回 17,复制 16”。接下来的 10 个字符也是之前的重复字符,“返回 16,复制 10”会处理它们。接下来有两个字符不是重复的,所以它们被听写为“E,W”。最后,最后 6 个字符是之前的重复字符,使用“返回 18,复制 6”来传达。
Let's see how this strategy works out in practice. The first 12 characters have no repetition, so you have no choice but to dictate them one by one: “V, J, G, D, N, Q M, Y, L, H, K, W.” But the next 10 characters are the same as some earlier ones, so you could just say “back 12, copy 10.” The next seven are new, and get dictated one by one: “A, D, X, S, G, F, O.” But the 16 characters after that are one big repeat, so you can say “back 17, copy 16.” The next 10 are repeats from earlier too, and “back 16, copy 10” takes care of them. Following that are two characters that aren't repeats, so they are dictated as “E, W.” Finally, the last 6 characters are repeats from earlier and are communicated using “back 18, copy 6.”
让我们试着总结一下我们的压缩算法。我们将使用缩写 b 表示“back”,c 表示“copy”。因此,像“back 18, copy 6”这样的back-and-copy指令可以缩写为 b18c6。这样,上面的听写指令就可以概括为
Let's try to summarize our compression algorithm. We'll use the abbreviation b for “back” and c for “copy.” So a back-and-copy instruction like “back 18, copy 6” gets abbreviated as b18c6. Then the dictation instructions above can be summarized as
VJGDNQMYLH-KW-b12c10-ADXSGF-O-b17c16-b16c10-EW-b18c6
VJGDNQMYLH-KW-b12c10-ADXSGF-O-b17c16-b16c10-EW-b18c6
这个字符串仅包含 44 个字符。原始字符串有 63 个字符,因此我们节省了 19 个字符,几乎是原始字符串长度的三分之一。
This string consists of only 44 characters. The original was 63 characters, so we have saved 19, or nearly a third of the length of the original.
这个与之前相同的技巧还有一个更有趣的变化。如何使用相同的技巧来压缩消息 FG-FG-FG-FG-FG-FG-FG-FG?(再次强调,破折号不是消息的一部分,只是为了方便阅读而添加的。)消息中 FG 重复了 8 次,因此我们可以分别口述前四个,然后使用如下的反向复制指令:FG-FG-FG-FG-b8c8。这可以节省不少字符,但我们可以做得更好。它需要一个乍一看可能毫无意义的反向复制指令:“反向 2,复制 14”,或者用我们的缩写表示法来说就是 b2c14。压缩后的消息实际上是 FG-b2c14。如果只有 2 个字符可供复制,那么复制 14 个字符怎么可能合理呢?事实上,只要您从正在重新生成的消息中复制,而不是从压缩消息中复制,这根本不会导致任何问题。让我们一步一步来。在口述前两个字符之后,我们有了 FG。然后 b2c14 指令到达,因此我们回退 2 个字符并开始复制。只有两个可用字符(FG),因此让我们复制它们:当它们添加到我们已有的内容时,结果是 FG-FG。但是现在有两个可用的字符!因此也复制它们,将它们添加到现有的重新生成的消息后,您将得到 FG-FG-FG。同样,还有两个可用的字符,因此您可以再复制两个。并且可以继续此操作,直到复制了所需数量的字符(在本例中为 14)。要检查您是否理解了这一点,请查看您是否可以计算出此压缩消息的未压缩版本:Ab1c250。1
There is one more interesting twist on this same-as-earlier trick. How would you use the same trick to compress the message FG-FG-FG-FG-FG-FG-FG-FG? (Again, the dashes are not part of the message but are only added for readability.) Well, there are 8 repetitions of FG in the message, so we could dictate the first four individually, and then use a back-and-copy instruction as follows: FG-FG-FG-FG-b8c8. That saves quite a few characters, but we can do even better. It requires a back-and-copy instruction that might at first seem nonsensical: “back 2, copy 14,” or b2c14 in our abbreviated notation. The compressed message is, in fact, FG-b2c14. How can it possibly make sense to copy 14 characters when only 2 are available to be copied? In fact, this causes no problem at all as long as you copy from the message being regenerated, and not the compressed message. Let's do this step by step. After the first two characters have been dictated, we have FG. Then the b2c14 instruction arrives, so we go back 2 characters and start copying. There are only two characters available (FG), so let's copy those: when they are added to what we had already, the result is FG-FG. But now there are two more characters available! So copy those as well, and after adding them to the existing regenerated message you have FG-FG-FG. Again, two more characters available so you can copy two more. And this can continue until you have copied the required number of characters (in this case, 14). To check that you understood this, see if you can work out the uncompressed version of this compressed message: Ab1c250.1
更短的符号技巧
The Shorter-Symbol Trick
要理解这个我们称之为“短符号技巧”的压缩技巧,我们需要更深入地研究一下计算机存储消息的方式。你可能已经听说过,计算机实际上并不存储像a、b和 c 这样的字母。所有的东西都被存储为数字,然后根据某个固定的表格解释为字母。(这种在字母和数字之间转换的技术在我们第 68 页的校验和讨论中也提到过。)例如,我们可能同意“a”用数字 27 表示,“b”用 28 表示,“c”用 29 表示。那么字符串“abc”在计算机中会存储为“272829”,但在显示在屏幕上或打印在纸上之前可以很容易地转换回“abc”。
To understand the compression trick that we'll be calling the “shorter-symbol trick,” we need to delve a little deeper into how computers store messages. As you may have heard before, computers do not actually store letters like a, b, and c. Everything gets stored as a number, and then gets interpreted as a letter according to some fixed table. (This technique for converting between letters and numbers was also mentioned in our discussion of checksums, on page 68.) For example, we might agree that “a” is represented by the number 27, “b” is 28, and “c” is 29. Then the string “abc” would be stored as “272829” in the computer, but could easily be translated back into “abc” before it is displayed on the screen or printed on a piece of paper.
下一页的表格列出了计算机可能需要存储的 100 个符号,以及每个符号对应的两位数代码。顺便说一句,这组特定的两位数代码在任何实际计算机系统中都没有使用,但实际生活中使用的符号非常相似。主要区别在于计算机不使用人类使用的十进制系统。相反,正如您可能已经知道的,它们使用一种称为二进制的另一种数字系统。但这些细节对我们来说并不重要。短符号压缩技巧适用于十进制和二进制数字系统,因此我们假设计算机使用十进制,以便于解释。
The table on the next page gives a complete list of 100 symbols that a computer might want to store, together with a 2-digit code for each one. By the way, this particular set of 2-digit codes is not used in any real computer system, but the ones used in real life are quite similar. The main difference is that computers don't use the 10-digit decimal system that humans use. Instead, as you may already know, they use a different numeric system called the binary system. But those details are not important for us. The shorter-symbol compression trick works for both decimal and binary number systems, so we will pretend that computers use decimal, just to make the explanation easier to follow.
仔细查看符号表。注意,表中的第一项提供了单词之间空格的数字代码“00”。之后是从A(“01”)到Z (“26”)的大写字母,以及从a(“27”)到Z (“52”)的小写字母。接下来是各种标点符号,最后一列包含一些用于书写非英语单词的字符,以á (“80”)开头,以Ù(“99”)结尾。
Take a closer look at the table of symbols. Notice that the first entry in the table provides a numeric code for a space between words, “00.” After that come the capital letters from A (“01”) to Z (“26”) and the lowercase letters from a (“27”) to Z (“52”). Various punctuation characters follow, and finally some characters for writing non-English words are included in the last column, starting with á (“80”) and ending with Ù (“99”).
那么,计算机该如何使用这些两位数代码来存储“在那里见你的未婚夫”这句话呢?很简单:只需将每个字符转换成相应的数字代码,然后将它们串在一起即可:
So how would these 2-digit codes be used by a computer to store the phrase “Meet your fiancé there.”? Simple: just translate each character into its numeric code and string them all together:
在这里见见你的未婚夫。1331314600514147440032352740298200463431443166
M e e t y o u r f i a n c é t h e r e .
1331314600514147440032352740298200463431443166
务必认识到,在计算机内部,数字对之间没有分隔。因此,此消息实际上存储为一个由 46 位数字组成的连续字符串:“1331314600514147440032352740298200463431443166”。当然,这会使人类解读起来有些困难,但对计算机来说却毫无问题,因为计算机可以轻松地将数字分成两对,然后再将它们转换成要在屏幕上显示的字符。关键在于,如何分离数字代码没有任何歧义,因为每个代码都恰好使用两位数字。事实上,这正是A表示为“01”而不是“1”的原因,B表示为“02”而不是“2”,依此类推,直到字母I(“09”而不是“9”)。如果我们选择取A = “1”,B = “2”,依此类推,那么就不可能对消息进行无歧义的解读。例如,消息“1123”可以分解为“1 1 23”(转换为 AAW),或者“11 2 3”(KBC),甚至“1 1 2 3”(AABC)。因此,请记住这个重要概念:数字代码和字符之间的转换必须无歧义,即使代码彼此相邻存储且没有分隔。这个问题很快就会再次困扰我们!
It's very important to realize that inside the computer, there is no separation between the pairs of digits. So this message is actually stored as a continuous string of 46 digits: “1331314600514147440032352740298200463431443166.” Of course, this makes it a little harder for a human to interpret, but presents no problem whatsoever for a computer, which can easily separate the digits into pairs before translating them into characters to be displayed on the screen. The key point is that there is no ambiguity in how to separate out the numeric codes, since each code uses exactly two digits. In fact, this is exactly the reason that A is represented as “01” rather than just “1”—and B is “02” not “2,” and so on up to the letter I (“09” not “9”). If we had chosen to take A = “1,” B = “2,” and so on, then it would be impossible to interpret messages unambiguously. For example, the message “1123” could be broken up as “1 1 23” (which translates to AAW), or as “11 2 3” (KBC) or even “1 1 2 3” (AABC). So try to remember this important idea: the translation between numeric codes and characters must be unambiguous, even when the codes are stored next to each other with no separation. This issue will come back to haunt us surprisingly soon!
计算机可以用来存储符号的数字代码。
Numeric codes that a computer could use for storing symbols.
与此同时,让我们回到短符号技巧。与本书中描述的许多所谓的技术性想法一样,短符号技巧也是人类一直在做的事情,甚至无需思考。其基本思想是,如果你经常使用某个词,那么为它创建一个简写缩写是值得的。每个人都知道“USA”是“United States of America”的缩写——每次我们输入或说出3个字母的代码“USA”而不是它所代表的完整24个字母的短语时,我们都会省去很多力气。但我们不会为每个24个字母的短语都费心去计算3个字母的代码。你知道“The sky is blue in color”的缩写吗?它恰好也是一个24个字母的短语。当然不知道!但为什么呢?“United States of America”和“The sky is blue in color”有什么区别?关键的区别在于其中一个短语比另一个短语使用得更频繁,并且我们可以通过缩写一个经常使用的短语而不是一个很少使用的短语来节省更多的精力。
Meanwhile, let's get back to the shorter-symbol trick. As with many of the supposedly technical ideas described in this book, the shorter-symbol trick is something that humans do all the time without even thinking about it. The basic idea is that if you use something often enough, it's worth having a shorthand abbreviation for it. Everyone knows that “USA” is short for “United States of America”—we all save a lot of effort each time we type or say the 3-letter code “USA” instead of the full 24-letter phrase it stands for. But we don't bother with 3-letter codes for every 24-letter phrase. Do you know an abbreviation for “The sky is blue in color,” which also happens to be a 24-letter phrase? Of course not! But why? What is the difference between “United States of America” and “The sky is blue in color”? The key difference is that one of these phrases is used much more often than the other, and we can save a lot more effort by abbreviating a frequently used phrase instead of one that is rarely used.
让我们尝试将这个想法应用到上一页显示的编码系统中。我们已经知道,对常用的东西使用缩写可以节省最多的精力。嗯,字母“e”和“t”是英语中最常用的字母,所以让我们尝试为每个字母使用更短的代码。目前,“e”是31,“t”是46——所以每个字母都需要两位数字来表示。那么把它们缩减到只有一位数字怎么样?假设“e”现在用个位数8表示,“t”用9表示。这是一个好主意!还记得我们之前是如何对短语“在那里见你的未婚夫”进行编码的吗?总共使用了46位数字。现在我们可以这样做,只使用40位数字:
Let's try and apply this idea to the coding system shown on the previous page. We already know that we can save the most effort by using abbreviations for things that are used frequently. Well, the letters “e' and “t” are the ones used most often in English, so let's try to use a shorter code for each of those letters. At the moment, “e” is 31 and “t” is 46—so it takes two digits to represent each of these letters. How about cutting them down to only one digit? Let's say “e” is now represented by the single digit 8, and “t” is 9. This is a great idea! Remember how we encoded the phrase “Meet your fiancé there.” earlier, using a total of 46 digits. Now we can do it as follows, using only 40 digits:
在这里见见你的未婚夫。138
8 9 005141474400323527402982009 348 448 66
M e e t y o u r f i a n c é t h e r e .
138 8 9 005141474400323527402982009 348 448 66
不幸的是,这个计划存在一个致命的缺陷。记住,计算机不会存储单个字母之间的空格。所以编码实际上看起来并不像“13 8 8 9 00 51…44 8 66”,而是像“138890051…44866”。你能看出问题所在吗?只关注前五位数字,也就是 13889。注意,代码 13 代表“M”,8 代表“e”,9 代表“t”,因此解码数字 13889 的一种方法是将它们拆分为 138-8-9,得到单词“Meet”。但 88 代表重音符号“ú”,因此数字 13889 也可以拆分为 13-88-9,代表“Mút”。事实上情况更糟,因为 89 代表略有不同的重音符号“ù”,所以 13889 的另一种可能拆分是 13-8-89,代表“Meù”。完全无法分辨这三种可能的解释哪一种是正确的。
Unfortunately, there is a fatal flaw in this plan. Remember that the computer does not store the spaces between the individual letters. So the encoding doesn't really look like “13 8 8 9 00 51…44 8 66.” Instead it looks like “138890051…44866.” Can you see the problem yet? Concentrate on just the first five digits, which are 13889. Notice that the code 13 represents “M,” 8 represents “e,” and 9 represents “t,” so one way of decoding the digits 13889 is to split them up as 138-8-9, giving the word “Meet.” But 88 represents the accented symbol “ú,” so the digits 13889 might also be split up as 13-88-9, which represents “Mút.” In fact the situation is even worse, because 89 represents the slightly different accented symbol “ù,” so another possible split of 13889 is 13-8-89, representing “Meù.” There is absolutely no way to tell which of the three possible interpretations is correct.
灾难!我们巧妙地计划使用更短的字母“e”和“t”代码,结果却导致编码系统完全失效。好在,还有一个小技巧可以解决这个问题。真正的问题是,每当我们看到数字 8 或 9 时,我们根本无法判断它是一位数代码(例如“e”或“t”),还是以 8 或 9 开头的两位数代码(例如各种重音符号,如“á”和“è”)。为了解决这个问题,我们必须做出一些牺牲:一些代码实际上会变得更长。那些以 8 或 9 开头的模糊两位数代码将变成不以 8 或 9 开头的三位数代码。第 114 页的表格显示了一种实现此目的的特殊方法。一些标点符号也受到了影响,但最终我们得到了一个非常理想的结果:以 a7 开头的字符都是三位数代码,以 8 或 9 开头的字符都是一位数代码,而以 0、1、2、3、4、5 或 6 开头的字符则和之前一样,都是两位数代码。因此,现在只有一种方法可以拆分数字 13889(13-8-8-9,代表“Meet”)——对于任何其他正确编码的数字序列,方法都一样。所有歧义都被消除了,我们的原始消息可以像这样编码:
Disaster! Our cunning plan to use shorter codes for the letters “e” and “t” has led to a coding system that doesn't work at all. Fortunately, it can be fixed with one more trick. The real problem is that whenever we see a digit 8 or 9, there is no way to tell if it is part of a one-digit code (for either “e” or “t”), or one of the two-digit codes that starts with 8 or 9 (for the various accented symbols like “á” and “è”). To solve this problem, we have to make a sacrifice: some of our codes will actually get longer. The ambiguous two-digit codes that start with 8 or 9 will become three-digit codes that do not start with 8 or 9. The table on page 114 shows one particular way of achieving this. Some of the punctuation characters got affected too, but we finally have a very nice situation: anything starting with a7isa three-digit code, anything starting with an 8 or 9 is a one-digit code, and anything starting with 0, 1, 2, 3, 4, 5 or 6 is the same two-digit code as before. So there is exactly one way to split up the digits 13889 now (13-8-8-9, representing “Meet”)—and this is true for any other correctly coded sequence of digits. All ambiguity has been removed, and our original message can be encoded like this:
在这里见见你的未婚夫。138
8 9 0051414744003235274029782009 348 448 66
M e e t y o u r f i a n c é t h e r e .
138 8 9 0051414744003235274029782009 348 448 66
原始编码使用了 46 位数字,而这个编码只使用了 41 位。这看起来节省的空间不大,但对于较长的消息来说,节省的空间可能非常可观。例如,这本书的文本(仅包含文字,不包括图片)需要近 500 KB 的存储空间——也就是 50 万个字符!但使用刚才介绍的两个技巧进行压缩后,大小减少到只有 160 KB,不到原始大小的三分之一。
The original encoding used 46 digits, and this uses only 41. This might seem like a small saving, but with a longer message the savings can be very significant. For example, the text of this book (that is, just the words, with images excluded) requires nearly 500 kilobytes of storage—that's half a million characters! But when compressed using the two tricks just described, the size is reduced to only 160 kilobytes, or less than one-third of the original.
摘要:免费午餐从何而来?
Summary: Where Did the Free Lunch Come From?
至此,我们了解了在计算机上创建典型压缩 ZIP 文件背后的所有重要概念。具体过程如下:
At this point, we understand all the important concepts behind the creation of typical compressed ZIP files on a computer. Here's how it happens:
步骤 1.使用与之前相同的技巧对原始未压缩文件进行转换,以便文件中的大多数重复数据被更短的指令替换,以便返回并从其他地方复制数据。
Step 1. The original uncompressed file is transformed using the same-as-earlier trick, so that most of the repeated data in the file is replaced by much shorter instructions to go back and copy the data from somewhere else.
步骤2:检查转换后的文件,查看哪些符号出现频率较高。例如,如果原始文件是用英语编写的,那么计算机可能会发现“e”和“t”是两个最常见的符号。然后,计算机会构建一个类似下一页的表格,其中常用符号被赋予较短的数字代码,不常用符号被赋予较长的数字代码。
Step 2. The transformed file is examined to see which symbols occur frequently. For example, if the original file was written in English, then the computer will probably discover that “e” and “t” are the two most common symbols. The computer then constructs a table like the one on the following page, in which frequently used symbols are given short numeric codes and rarely used symbols are given longer numeric codes.
步骤 3.通过直接转换为步骤 2 中的数字代码,再次转换文件。
Step 3. The file is transformed again by directly translating into the numeric codes from Step 2.
在步骤 2 中计算出的数字代码表也存储在 ZIP 文件中——否则,以后将无法解码(进而解压)ZIP 文件。请注意,不同的未压缩文件会产生不同的数字代码表。实际上,在真正的 ZIP 文件中,原始文件会被拆分成多个块,每个块可以有不同的数字代码表。所有这些都可以高效自动地完成,从而对多种类型的文件实现出色的压缩。
The table of numeric codes, computed in step 2, is also stored in the ZIP file—otherwise it would be impossible to decode (and hence decompress) the ZIP file later. Note that different uncompressed files will result in different tables of numeric codes. In fact, in a real ZIP file, the original file is broken up into chunks and each chunk can have a different numeric code table. All of this can be done efficiently and automatically, achieving excellent compression on many types of files.
使用较短符号技巧的数字代码。与第 111 页表格相比的更改以粗体显示。两个常见字母的代码已被缩短,但大量不常见符号的代码则被加长。这导致大多数消息的总长度缩短。
Numeric codes using the shorter-symbol trick. Changes to the previous table on page 111 are shown in bold. The codes for two common letters have been shortened, at the expense of lengthening the codes for a larger number of uncommon symbols. This results in a shorter total length for most messages.
有损压缩:不是免费的午餐,但却非常划算
LOSSY COMPRESSION: NOT A FREE LUNCH, BUT A VERY GOOD DEAL
到目前为止,我们一直在讨论称为无损压缩的类型,因为您可以获取压缩文件并重建与开始时完全相同的文件,甚至不需要更改一个字符或一个标点符号。相反,有时使用有损压缩会更有用,它允许您获取压缩文件并重建与原始文件非常相似但不一定完全相同的文件。例如,有损压缩经常用于包含图像或音频数据的文件:只要图片在人眼看来相同,那么计算机上存储该图片的文件是否与相机上存储该图片的文件完全相同并不重要。音频数据也是如此:只要歌曲在人耳听起来相同,那么数字音乐播放器上存储该歌曲的文件是否与光盘上存储该歌曲的文件完全相同并不重要。
So far, we have been talking about the type of compression known as lossless, because you can take a compressed file and reconstruct exactly the same file that you started with, without even one character or one punctuation mark being changed. In contrast, sometimes it is much more useful to use lossy compression, which lets you take a compressed file and reconstruct one that is very similar to the original, but not necessarily exactly the same. For example, lossy compression is used very frequently on files that contain images or audio data: as long as a picture looks the same to the human eye, it doesn't really matter whether the file that stores that picture on your computer is exactly the same as the file that stores it on your camera. And the same is true for audio data: as long as a song sounds the same to the human ear, it doesn't really matter whether the file storing that song on your digital music player is exactly the same as the file that stores that song on a compact disc.
事实上,有损压缩有时被运用得更为极端。我们都见过网络上低质量的视频和图片,画面模糊或音质糟糕。这是因为使用了更激进的有损压缩技术,将视频或图片的文件大小压缩到非常小。这里的意思并非是要使视频在人眼看来与原始视频完全相同,而是至少要使其可识别。通过调整压缩的“有损”程度,网站运营商可以在画质和音质近乎完美的大型高质量文件与存在明显缺陷但传输带宽需求更低的低质量文件之间进行权衡。您可能在数码相机上也做过同样的事情,通常可以选择不同的图像和视频质量设置。如果您选择高质量设置,相机上可存储的图片或视频数量会比选择低质量设置时少。这是因为高质量媒体文件比低质量文件占用更多空间。而这一切都是通过调整压缩的“有损”程度来实现的。在本节中,我们将找出进行此调整的一些技巧。
In fact, sometimes lossy compression is used in a much more extreme way. We have all seen low-quality videos and images on the internet in which the picture is blurry or the sound quality rather bad. This is the result of lossy compression being used in a more aggressive fashion to make the file size of the videos or images very small. The idea here is not that the video looks the same as the original to the human eye, but rather that it is at least recognizable. By tuning just how “lossy” the compression is, website operators can trade off between large, high-quality files that look and sound almost perfect, and low-quality files that have obvious defects but require much less bandwidth to transmit. You may have done the same thing on a digital camera, where you can usually choose different settings for the quality of images and videos. If you choose a high-quality setting, the number of pictures or videos you can store on the camera is smaller than when you choose a lower quality setting. That's because high-quality media files take up more space than low-quality ones. And it's all done by tuning just how “lossy” the compression is. In this section, we will find out a few of the tricks for doing this tuning.
留下它的技巧
The Leave-It-Out Trick
有损压缩的一个简单实用技巧是舍弃部分数据。让我们来看看这个“舍弃”技巧在黑白照片中是如何运作的。首先,我们需要了解一下黑白照片在计算机中的存储方式。一张图片由大量称为“像素”的小点组成。每个像素只有一种颜色,可以是黑色、白色或介于黑色和白色之间的任何灰色。当然,由于这些像素非常小,我们通常意识不到它们的存在,但如果你仔细观察显示器或电视屏幕,就能看到单个像素。
One simple and useful trick for lossy compression is to simply leave out some of the data. Let's take a look at how this “leave-it-out” trick works in the case of black-and-white pictures. First we need to understand a little about how black-and-white pictures are stored in a computer. A picture consists of a large number of small dots, called “pixels.” Each pixel has exactly one color, which could be black, white, or any shade of gray in between. Of course, we are not generally aware of these pixels because they are so small, but you can see the individual pixels if you look closely enough at a monitor or TV screen.
在计算机中存储的黑白图片中,每个可能的像素颜色都用一个数字表示。在本例中,我们假设数字越大代表颜色越白,100 为最高数字。因此,100 代表白色,0 代表黑色,50 代表中灰色,90 代表浅灰色,依此类推。像素排列成由行和列组成的矩形阵列,每个像素代表图片中某个非常小部分的颜色。行数和列数的总数就是图像的“分辨率”。例如,许多高清电视的分辨率为 1920 x 1080,这意味着有 1920 列像素和 1080 行像素。总像素数为 1920 乘以 1080,即超过 200 万像素!数码相机使用相同的术语。“百万像素”只是百万像素的别称。因此,一台 500 万像素的相机拥有足够多的像素行和列,因此,当你将行数乘以列数时,得到的像素数量将超过 500 万。当一张照片存储在计算机中时,它只是一串数字,每个像素对应一个数字。
In a black-and-white picture stored in a computer, each possible pixel color is represented by a number. For this example, let's assume that higher numbers represent whiter colors, with 100 being the highest. So 100 represents white, 0 represents black, 50 represents a medium shade of gray, 90 represents a light gray, and so on. The pixels are arranged in a rectangular array of rows and columns, with each pixel representing the color at some very small part of the picture. The total number of rows and columns tells you the “resolution” of the image. For example, many high-definition TV sets have a resolution of 1920 by 1080—that means there are 1920 columns of pixels and 1080 rows of pixels. The total number of pixels is found by multiplying 1920 by 1080, which gives over 2 million pixels! Digital cameras use the same terminology. A “megapixel” is just a fancy name for a million pixels. So a 5-megapixel camera has enough rows and columns of pixels so that when you multiply the number of rows by the number of columns, you get more than 5 million. When a picture is stored in a computer, it is just a list of numbers, one for each pixel.
下一页图片左上角显示的是一栋带塔楼的房屋,其分辨率远低于高清电视:只有 320 x 240。尽管如此,像素数量仍然相当大(320 x 240 = 76,800),而以未压缩形式存储这张图片的文件占用超过 230 KB 的存储空间。顺便说一下,1 KB 大约相当于 1000 个字符的文本——大致相当于一段电子邮件的大小。因此,大致来说,左上角的图片以文件形式存储时所需的磁盘空间相当于大约 200 封简短的电子邮件。
The picture of a house with a turret shown at the top left of the figure on the next page has a much lower resolution than a highdefinition TV: only 320 by 240. Nevertheless, the number of pixels is still rather large (320 × 240 = 76,800), and the file that stores this picture in uncompressed form uses over 230 kilobytes of storage space. A kilobyte, by the way, is equivalent to about 1000 characters of text—roughly the size of a one-paragraph e-mail, for instance. Very approximately, then, the top-left picture, when stored as a file, requires the same amount of disk space as around 200 short e-mail messages.
我们可以使用以下极其简单的技巧来压缩此文件:忽略或“省去”每隔一行像素和每隔一列像素。省去的技巧真的就这么简单!在这种情况下,它会生成一张分辨率较小的图片,为 160 x 120,如图中原始图片下方所示。此文件的大小只有原始文件的四分之一(约 57 KB)。这是因为像素数量只有原始文件的四分之一——我们将图像的宽度和高度都减少了一半。实际上,图像的尺寸两次缩小了 50%——一次水平,一次垂直——最终大小只有原始大小的 25%。
We can compress this file with the following extremely simple technique: ignore, or “leave out,” every second row of pixels and every second column of pixels. The leave-it-out trick really is that simple! In this case, it results in a picture with a smaller resolution of 160 by 120, shown below the original picture in the figure. The size of this file is only one-quarter of the original (about 57 kilobytes). This is because there are only one-quarter as many pixels—we reduced both the width and the height of the image by one-half. Effectively, the size of the image was reduced by 50% twice—once horizontally and once vertically—resulting in a final size that is only 25% of the original.
我们可以再次使用这个技巧。取新的 160 x 120 图像,每隔一行和一列删除一部分,得到另一张新图像,这次只有 80 x 60——结果显示在图的左下角。图像大小再次缩小了 75%,最终文件大小只有 14 KB。这大约只有原始大小的 6%——压缩效果非常令人印象深刻。
And we can do this trick again. Take the new 160 by 120 image, and leave out every second row and column to get another new image, this time only 80 by 60—the result is shown at the bottom left of the figure. The image size is reduced by 75% again, resulting in a final file size of only 14 kilobytes. That's only about 6% of the original—some very impressive compression.
使用留空技巧进行压缩。左列显示了原始图像和两个较小的缩小版本。每个缩小图像都是通过省略前一个图像中一半的行和列来计算的。在右列中,我们看到了将缩小图像解压缩到与原始图像相同大小的效果。重建并不完美,重建结果与原始图像之间存在一些明显的差异。
Compression using the leave-it-out trick. The left column shows the original image, and two smaller, reduced versions of this image. Each reduced image is computed by leaving out half of the rows and columns in the previous one. In the right column, we see the effect of decompressing the reduced images to the same size as the original. The reconstruction is not perfect and there are some noticeable differences between the reconstructions and the original.
但请记住,我们使用的是有损压缩,所以这次我们不会得到免费的午餐。午餐虽然便宜,但我们必须为此付出代价。看看当我们将其中一个压缩文件解压回原始大小时会发生什么。由于部分行和列的像素被删除,计算机必须猜测这些缺失像素的颜色应该是什么。最简单的猜测是让任何缺失像素的颜色与其相邻像素的颜色相同。任何相邻像素的选择都可以,但这里显示的示例选择了缺失像素正上方和左侧的像素。
But remember, we are using lossy compression, so we don't get a free lunch this time. The lunch is cheap, but we do have to pay for it. Look at what happens when we take one of the compressed files and decompress it back to the original size. Because some of the rows and columns of pixels were deleted, the computer has to guess what the colors of those missing pixels should be. The simplest possible guess is to give any missing pixel the same color as one of its neighbors. Any choice of neighbor would work fine, but the examples shown here choose the pixel immediately above and to the left of the missing pixel.
此解压方案的结果显示在图的右侧。您可以看到,大部分视觉特征都得到了保留,但质量和细节明显有所损失,尤其是在树木、塔楼屋顶以及房屋山墙上的雕花等复杂区域。此外,尤其是在从 80 x 60 图像解压后的版本中,您可以看到一些相当难看的锯齿状边缘,例如房屋屋顶的对角线上。这些就是我们所说的“压缩伪影”:不仅仅是细节的损失,还包括由特定的有损压缩方法和解压方法引入的明显的新特征。
The result of this decompression scheme is shown in the right-hand column of the figure. You can see that most of the visual features have been retained, but there is some definite loss of quality and detail, especially in complex areas like the tree, the turret's roof, and the fret-work in the gable of the house. Also, especially in the version decompressed from the 80 by 60 image, you can see some rather unpleasant jagged edges, for example, on the diagonal lines of the house's roof. These are what we call “compression artifacts”: not just a loss of detail, but noticeable new features that are introduced by a particular method of lossy compression followed by decompression.
虽然省略技巧有助于理解有损压缩的基本思想,但它很少以本文描述的简单形式使用。计算机确实会“省略”信息以实现有损压缩,但它们在选择省略哪些信息时要谨慎得多。一个常见的例子是 JPEG 图像压缩格式。JPEG 是一种精心设计的图像压缩技术,其性能远优于每隔一行或每列省略一次。请看一下对面页面上的图,并将图像的质量和大小与上一图进行比较。在顶部,我们有一张 JPEG 图像,其大小为 35 KB,但它几乎与原始图像难以区分。通过省略更多信息,但仍然使用 JPEG 格式,我们可以得到中间那张 19 KB 的图像,尽管你可以看到房屋的镂空装饰有些模糊和细节丢失,但它仍然具有出色的质量。但是,如果压缩过度,即使是 JPEG 也会出现压缩伪影:在底部您可以看到压缩到只有 12 千字节的 JPEG 图像,您会注意到天空中有一些块状效果,并且在房屋对角线旁边的天空中有一些令人不快的斑点。
Although it's useful for understanding the basic idea of lossy compression, the leave-it-out trick is rarely used in the simple form described here. Computers do indeed “leave out” information to achieve lossy compression, but they are much more careful about which information they leave out. A common example of this is the JPEG image compression format. JPEG is a carefully designed image compression technique which has far better performance than leaving out every second row and column. Take a look at the figure on the facing page, and compare the quality and size of the images with the previous figure. At the top, we have a JPEG image whose size is 35 kilobytes, and yet it is virtually indistinguishable from the original image. By leaving out more information, but sticking with the JPEG format, we can get down to the 19-kilobyte image in the center, which still has excellent quality although you can see some blurring and loss of detail in the fret-work of the house. But even JPEG suffers from compression artifacts if the compression is too extreme: at the bottom you can see a JPEG image compressed down to only 12 kilobytes, and you'll notice some blocky effects in the sky and some unpleasant blotches in the sky right next to the diagonal line of the house.
虽然 JPEG 省略策略的细节过于专业,无法在此完整描述,但该技术的基本原理相当简单。JPEG 首先将整幅图像分成 8 像素 x 8 像素的小方块。每个方块都单独压缩。注意,如果不进行任何压缩,每个方块将由 8 × 8 = 64 个数字表示。(我们假设图片是黑白的——如果是彩色图像,则有三种不同的颜色,因此数字数量是黑白的三倍,但我们这里不讨论这个细节。)如果方块恰好只有一种颜色,那么整个方块就可以用一个数字表示,计算机可以“省略” 63 个数字。如果方块大部分颜色相同,只有一些非常细微的差别(例如天空中某一区域几乎全是相同的灰色),计算机可以决定用一个数字表示该方块,这样该方块的压缩效果良好,稍后解压缩时也只会出现少量误差。在上一页图片的底部,您实际上可以看到天空中的一些 8×8 块正是以这种方式压缩的,从而形成了颜色均匀的小方块。
Although the details of JPEG's leave-it-out strategy are too technical to be described completely here, the basic flavor of the technique is fairly straightforward. JPEG first divides the whole image into small squares of 8 pixels by 8 pixels. Each of these squares is compressed separately. Note that without any compression, each square would be represented by 8 × 8 = 64 numbers. (We are assuming that the picture is black-and-white—if it is a color image, then there are three different colors and therefore three times as many numbers, but we won't worry about that detail here.) If the square happens to be all one color, the entire square can be represented by a single number, and the computer can “leave out” 63 numbers. If the square is mostly the same color, with a few very slight differences (perhaps a region of sky that is almost all the same shade of gray), the computer can decide to represent the square by a single number anyway, resulting in good compression for that square with only a small amount of error when it gets decompressed later. In the bottom image of the figure on the previous page, you can actually see some of the 8-by-8 blocks in the sky that have been compressed in exactly this way, resulting in small square blocks of uniform color.
使用有损压缩方案时,压缩率越高,质量越低。同一张图片以三种不同的 JPEG 质量级别压缩后显示。顶部是质量最高,所需的存储空间也最大。底部是质量最低,所需的存储空间不到一半,但现在出现了明显的压缩伪影,尤其是在天空和屋顶边缘。
With lossy compression schemes, higher compression produces lower quality. The same image is shown compressed at three different JPEG quality levels. At the top is the highest quality, which also requires the most storage. At the bottom is the lowest quality, which requires less than half the storage, but now there are noticeable compression artifacts—especially in the sky and along the border of the roof.
如果 8×8 的正方形颜色从一种颜色平滑过渡到另一种颜色(例如,左侧深灰色到右侧浅灰色),那么 64 个数字可能会被压缩为两个:一个代表深灰色,一个代表浅灰色。JPEG 算法的工作原理并非完全相同,但它采用了相同的思路:如果 8×8 的正方形足够接近某些已知模式的组合,例如恒定颜色或平滑变化的颜色,那么大部分信息就可以被丢弃,只存储每种模式的级别或数量。
If the 8-by-8 square varies smoothly from one color to another (say, dark gray on the left to light gray on the right), then the 64 numbers might be compressed down to just two: a value for the dark gray and the value for the light gray. The JPEG algorithm does not work exactly like this, but it uses the same ideas: if an 8-by-8 square is close enough to some combination of known patterns like a constant color or a smoothly varying color, then most of the information can be thrown away, and just the level or amount of each pattern is stored.
JPEG 非常适合图片压缩,但音频和音乐文件呢?它们也使用有损压缩,并且使用相同的基本原理:忽略对最终产品影响不大的信息。流行的音乐压缩格式,例如 MP3 和 AAC,通常使用与 JPEG 相同的高级方法。音频被分成多个块,每个块单独压缩。与 JPEG 一样,以可预测方式变化的块可以用几个数字来描述。然而,音频压缩格式也可以利用关于人耳的已知事实。特别是,某些类型的声音对人类听众几乎没有影响,可以通过压缩算法消除,而不会降低输出质量。
JPEG works well for pictures, but how about audio and music files? These are also compressed using lossy compression, and they use the same basic philosophy: leave out information that has little effect on the final product. Popular music compression formats, such as MP3 and AAC, generally use the same high-level approach as JPEG. The audio is divided into chunks, and each chunk is compressed separately. As with JPEG, chunks that vary in a predictable way can be described with only a few numbers. However, audio compression formats can also exploit known facts about the human ear. In particular, certain types of sounds have little or no effect on human listeners and can be eliminated by the compression algorithm without reducing the quality of the output.
压缩算法的起源
THE ORIGINS OF COMPRESSION ALGORITHMS
本章中介绍的相同技巧——ZIP 文件中使用的主要压缩方法之一——被计算机科学家称为 LZ77 算法。该算法由两位以色列计算机科学家 Abraham Lempel 和 Jacob Ziv 发明,并于 1977 年发表。
The same-as-earlier trick described in this chapter—one of the main compression methods used in ZIP files—is known to computer scientists as the LZ77 algorithm. It was invented by two Israeli computer scientists, Abraham Lempel and Jacob Ziv, and published in 1977.
然而,要追溯压缩算法的起源,我们需要追溯三十年前的科学史。我们已经认识了贝尔实验室的科学家克劳德·香农,他于1948年发表的论文开创了信息论领域。香农是我们纠错码故事(第五章)中的两位主要英雄之一,但他和他1948年的论文在压缩算法的兴起中也发挥了重要作用。
To trace the origins of compression algorithms, however, we need to delve three decades further back into scientific history. We have already met Claude Shannon, the Bell Labs scientist who founded the field of information theory with his 1948 paper. Shannon was one of the two main heroes in our story of error-correcting codes (chapter 5), but he and his 1948 paper also figure importantly in the emergence of compression algorithms.
这并非巧合。事实上,纠错码和压缩算法是同一枚硬币的两面。归根结底,冗余的概念在第5章中被反复提及。如果一个文件存在冗余,它的长度就会超过必要的长度。重复第5章中的一个简单例子,该文件可能使用单词“five”而不是数字“5”。这样,像“fivq”这样的错误就很容易被识别和纠正。因此,纠错码可以被看作是向消息或文件添加冗余的一种原则性方法。
This is no coincidence. In fact, error-correcting codes and compression algorithms are two sides of the same coin. It all comes down to the notion of redundancy, which featured quite heavily in chapter 5. If a file has redundancy, it is longer than necessary. To repeat a simple example from chapter 5, the file might use the word “five” instead of the numeral “5.” That way, an error such as “fivq” can be easily recognized and corrected. Thus, error-correcting codes can be viewed as a principled way of adding redundancy to a message or file.
压缩算法则相反:它们从消息或文件中消除冗余。很容易想象,一个压缩算法会注意到文件中频繁出现的单词“five”,并将其替换为一个更短的符号(甚至可能是符号“5”),这恰好逆转了纠错编码的过程。实际上,压缩和纠错并不会像这样相互抵消。相反,好的压缩算法会消除低效的冗余类型,而纠错编码则会添加另一种更高效的冗余类型。因此,通常的做法是先压缩消息,然后再对其进行纠错。
Compression algorithms do the opposite: they remove redundancy from a message or file. It's easy to imagine a compression algorithm that would notice the frequent use of the word “five” in a file and replace this with a shorter symbol (which might even be the symbol “5”), exactly reversing the error-correction encoding process. In practice, compression and error correction do not cancel each other out like this. Instead, good compression algorithms remove inefficient types of redundancy, while error-correction encoding adds a different, more efficient type of redundancy. Thus, it is very common to first compress a message and then add some error correction to it.
让我们回到香农。他1948年发表的开创性论文,在其众多杰出贡献中,描述了一种最早的压缩技术。麻省理工学院的教授罗伯特·法诺(Robert Fano)也大约在同一时期发现了这项技术,这种方法现在被称为香农-法诺编码。事实上,香农-法诺编码是本章前面描述的短符号技巧的一种特殊实现方式。我们很快就会看到,香农-法诺编码很快被另一种算法取代,但这种方法非常有效,至今仍作为ZIP文件格式的可选压缩方法之一而存在。
Let's get back to Shannon. His seminal 1948 paper, among its many extraordinary contributions, included a description of one of the earliest compression techniques. An MIT professor, Robert Fano, had also discovered the technique at about the same time, and the approach is now known as Shannon-Fano coding. In fact, Shannon-Fano coding is a particular way of implementing the shorter-symbol trick described earlier in this chapter. As we shall soon see, Shannon-Fano coding was rapidly superseded by another algorithm, but the method is very effective and survives to this day as one of the optional compression methods in the ZIP file format.
香农和法诺都意识到,尽管他们的方法既实用又高效,但却并非最佳方案:香农已经从数学上证明了一定存在更好的压缩技术,但尚未找到实现方法。与此同时,法诺开始在麻省理工学院教授信息论研究生课程,并将如何实现最佳压缩作为该课程学期论文的选项之一。引人注目的是,他的一名学生解决了这个问题,并提出了一种能够为每个符号实现最佳压缩的方法。这名学生就是大卫·霍夫曼,他的技术(现称为霍夫曼编码)是短符号技巧的另一个例子。霍夫曼编码至今仍是一种基本的压缩算法,广泛应用于通信和数据存储系统。
Both Shannon and Fano were aware that although their approach was both practical and efficient, it was not the best possible: Shannon had proved mathematically that even better compression techniques must exist, but had not yet discovered how to achieve them. Meanwhile, Fano had started teaching a graduate course in information theory at MIT, and he posed the problem of achieving optimal compression as one of the options for a term paper in the course. Remarkably, one of his students solved the problem, producing a method that yields the best possible compression for each individual symbol. The student was David Huffman, and his technique—now known as Huffman coding—is another example of the shorter-symbol trick. Huffman coding remains a fundamental compression algorithm and is widely used in communication and data storage systems.
1解决方案:字母A重复251次。
1The solution: the letter A repeated 251 times.
8
8
数据库:追求一致性
Databases: The Quest for Consistency
——亚瑟·柯南·道尔作品《铜山毛榉历险记》中的夏洛克·福尔摩斯
—SHERLOCK HOLMES IN ARTHUR CONAN DOYLE'S The Adventure of the Copper Beeches
想象一下以下这个神秘的仪式。一个人从桌子上拿出一本特制的纸簿(称为支票簿),在上面写上一些数字,然后优雅地签名。然后,这个人撕下纸簿的上层纸张,将其放入信封中,并在信封正面贴上另一张纸(称为邮票)。最后,这个人拿着信封走到街上,来到一个存放信封的大箱子里。
Consider the following arcane ritual. A person takes from a desk a specially printed pad of paper (known as a checkbook), writes some numbers on it, and adds a signature with a flourish. The person then tears the top sheet from the pad, puts it in an envelope, and sticks another piece of paper (known as a stamp) on the front of the envelope. Finally, the person carries the envelope outside and down the street, to a large box where the envelope is deposited.
直到21世纪初,这一直是人们每月支付账单的例行公事:电话费、电费、信用卡账单等等。从那时起,在线账单支付和网上银行系统不断发展。这些系统的简单便捷,相比之下,之前的纸质支付方式显得费力且低效,简直荒谬。
Until the turn of the 21st century, this was the monthly ritual of anyone paying a bill: phone bills, electric bills, credit card bills, and so on. Since then, systems of online bill payment and online banking have evolved. The simplicity and convenience of these systems makes the previous paper-based method seem almost ludicrously laborious and inefficient by comparison.
哪些技术促成了这一转变?最显而易见的答案是互联网的出现,没有互联网,任何形式的在线交流都不可能实现。另一项关键技术是公钥加密技术,我们已在第四章讨论过。如果没有公钥加密技术,敏感的财务信息就无法在互联网上安全传输。然而,至少还有一项技术对于在线交易至关重要:数据库。作为计算机用户,我们大多数人对此一无所知,但实际上,我们所有的在线交易都是使用计算机科学家自 20 世纪 70 年代以来开发的复杂数据库技术来处理的。
What technologies have enabled this transformation? The most obvious answer is the arrival of the internet, without which online communication of any form would be impossible. Another crucial technology is public key cryptography, which we already discussed in chapter 4. Without public key crypto, sensitive financial information could not be securely transmitted over the internet. There is, however, at least one other technology that is essential for online transactions: the database. As computer users, most of us are blissfully unaware of it, but virtually all of our online transactions are processed using sophisticated database techniques, developed by computer scientists since the 1970s.
数据库解决了事务处理中的两个主要问题:效率和可靠性。数据库通过算法来提高效率,允许数千名客户同时进行交易,而不会导致任何冲突或不一致。数据库通过算法来提高可靠性,即使磁盘驱动器等计算机组件发生故障(通常会导致严重的数据丢失),数据也能完好无损。网上银行是一个典型的例子,它要求卓越的效率(同时为众多客户提供服务,而不会产生任何错误或不一致),以及几乎完美的可靠性。因此,为了集中讨论,我们将经常回到网上银行的例子。
Databases address two major issues in transaction processing: efficiency and reliability. Databases provide efficiency through algorithms that permit thousands of customers to simultaneously conduct transactions without leading to any conflicts or inconsistencies. And databases provide reliability through algorithms that allow data to survive intact despite the failure of computer components like disk drives, which would usually lead to severe data loss. Online banking is a canonical example of an application that requires outstanding efficiency (to serve many customers at once without producing any errors or inconsistencies) and essentially perfect reliability. So to focus our discussions, we will often return to the example of online banking.
在本章中,我们将学习数据库背后的三个基本且美妙的理念:预写日志、两阶段提交和关系数据库。这些理念使得数据库技术在存储某些类型的重要信息方面占据了绝对主导地位。与往常一样,我们将尝试关注每个理念背后的核心洞见,并找出一个使其发挥作用的技巧。预写日志归结为“待办事项列表技巧”,我们将首先讨论它。然后,我们将讨论两阶段提交协议,本文通过简单但强大的“准备后提交技巧”来描述它。最后,我们将通过学习“虚拟表技巧”来一窥关系数据库的世界。
In this chapter, we will learn about three of the fundamental—and beautiful—ideas behind databases: write-ahead logging, two-phase commit, and relational databases. These ideas have led to the absolute dominance of database technology for storing certain types of important information. As usual, we'll try to focus on the core insight behind each of these ideas, identifying a single trick that makes it work. Write-ahead logging boils down to the “to-do list trick,” which is tackled first. Then we move on to the two-phase commit protocol, described here via the simple but powerful “prepare-then-commit trick.” Finally, we will take a peek into the world of relational databases by learning about the “virtual table trick.”
但在学习这些技巧之前,我们先来理清一下数据库究竟是什么。事实上,即使在计算机科学专业文献中,“数据库”一词也可能有很多不同的含义,因此不可能给出一个单一、正确的定义。但大多数专家都同意,数据库的一个关键属性,也就是区别于其他信息存储方式的属性,是数据库中的信息具有预定义的结构。
But before learning any of these tricks, let's try to clear up the mystery of what a database actually is. In fact, even in the technical computer science literature, the word “database” can mean a lot of different things, so it is impossible to give a single, correct definition. But most experts would agree that the key property of databases, the one that distinguishes them from other ways of storing information, is that the information in a database has a predefined structure.
为了理解这里“结构”的含义,我们首先看一下它的对立面——非结构化信息的示例:
To understand what “structure” means here, let's first look at its opposite—an example of unstructured information:
罗西娜 (Rosina) 今年 35 岁,她和 26 岁的马特 (Matt) 是朋友。静怡 (Jingyi) 今年 37 岁,苏迪普 (Sudeep) 今年 31 岁。马特 (Matt)、静怡 (Jingyi) 和苏迪普 (Sudeep) 彼此都是朋友。
Rosina is 35, and she's friends with Matt, who is 26. Jingyi is 37 and Sudeep is 31. Matt, Jingyi, and Sudeep are all friends with each other.
这正是 Facebook 或 MySpace 等社交网站需要存储的关于其成员的信息。当然,这些信息不会以这种非结构化的方式存储。以下是相同信息的结构化形式:
This is exactly the type of information that a social networking site, like Facebook or MySpace, would need to store about its members. But, of course, the information would not be stored in this unstructured way. Here's the same information in a structured form:
计算机科学家将这种结构称为表。表的每一行包含关于单个事物(在本例中为一个人)的信息。表的每一列包含特定类型的信息,例如一个人的年龄或姓名。数据库通常由多个表组成,但我们最初的示例将保持简单,仅使用一个表。
Computer scientists call this type of structure a table. Each row of the table contains information about a single thing (in this case, a person). Each column of the table contains a particular type of information, such as a person's age or name. A database often consists of many tables, but our initial examples will keep things simple and use only a single table.
显然,对于人类和计算机来说,以结构化的表格形式操作数据比上面例子中的非结构化自由文本效率高得多。但数据库的优势远不止于易用性。
Obviously, it is vastly more efficient for humans and computers alike to manipulate data in the structured form of a table, rather than the unstructured free text in the example above. But databases have much more going for them than mere ease of use.
我们数据库之旅始于一个新概念:一致性。我们很快就会发现,数据库从业者对一致性的执着追求是有充分理由的。简单来说,“一致性”意味着数据库中的信息不会自相矛盾。如果数据库中存在矛盾,数据库管理员就会面临最可怕的噩梦:不一致。但不一致究竟是如何产生的呢?假设上表中的前两行略有变化,如下所示:
Our journey into the world of databases begins with a new concept: consistency. As we will soon discover, database practitioners are obsessed with consistency—and with good reason. In simple terms, “consistency” means that the information in the database doesn't contradict itself. If there is a contradiction in the database, we have the worst nightmare of the database administrator: inconsistency. But how could an inconsistency arise in the first place? Well, imagine that the first two rows in the table above were changed slightly, giving:
你能发现这里的问题吗?根据第一行,罗西娜和静怡是朋友。但根据第二行,静怡和罗西娜不是朋友。这违反了友谊的基本概念,即两个人同时是朋友。诚然,这是一个相当温和的不一致的例子。
Can you spot the problem here? According to the first row, Rosina is friends with Jingyi. But according to the second row, Jingyi is not friends with Rosina. This violates the basic notion of friendship, which is that two people are simultaneously friends with each other. Admittedly, this is a rather benign example of inconsistency.
想象一个更严重的情况,假设“友谊”的概念被“婚姻”取代。那么最终结果是A嫁给了B,而B嫁给了C——这种情况在很多国家实际上是违法的。
To imagine a more serious case, suppose that the concept of “friendship” is replaced with “marriage.” Then we would end up with A married to B, but B married to C—a situation that is actually illegal in many countries.
实际上,当新数据添加到数据库时,这种不一致很容易避免。计算机非常擅长遵循规则,因此很容易设置一个数据库来遵循“如果A与B结婚,那么B也一定与 A 结婚”的规则。如果有人尝试输入违反此规则的新行,他们将收到错误消息,并且输入将失败。因此,基于简单规则确保一致性并不需要任何巧妙的技巧。
Actually, this type of inconsistency is easy to avoid when new data is added to the database. Computers are great at following rules, so it's easy to set up a database to follow the rule “If A is married to B, then B must be married to A.” If someone tries to enter a new row that violates this rule, they will receive an error message and the entry will fail. So ensuring consistency based on simple rules doesn't require any clever trick.
但还有其他类型的不一致需要更巧妙的解决方案。我们接下来会讨论其中一种。
But there are other types of inconsistency that require much more ingenious solutions. We'll look at one of these next.
交易和待办事项清单技巧
TRANSACTIONS AND THE TO-DO LIST TRICK
事务可能是数据库领域中最重要的概念。但是,要理解它们是什么以及为什么需要它们,我们需要接受两个关于计算机的事实。第一个事实你可能非常熟悉:计算机程序会崩溃——当程序崩溃时,它会忘记它正在做的所有事情。只有明确保存到计算机文件系统中的信息才会被保留。我们需要知道的第二个事实相当晦涩,但极其重要:计算机存储设备(例如硬盘驱动器和闪存棒)只能瞬间写入少量数据——通常约为 500 个字符。(如果你对技术术语感兴趣,我在这里指的是硬盘的扇区大小,通常为 512 字节。对于闪存,相关的数量是页面大小,也可能是数百或数千字节。)作为计算机用户,我们从未注意到设备上瞬间存储数据的这个小尺寸限制,因为现代驱动器每秒可以执行数十万次这样的 500 个字符的写入操作。但事实是,磁盘的内容每次只会改变几百个字符。
Transactions are probably the most important idea in the world of databases. But to understand what they are, and why they are necessary, we need to accept two facts about computers. The first fact is one that you are probably all too familiar with: computer programs crash—and when a program crashes, it forgets everything it was doing. Only information that was explicitly saved to the computer's file system is preserved. The second fact we need to know is rather obscure, but extremely important: computer storage devices, such as hard drives and flash memory sticks, can write only a small amount of data instantaneously—typically about 500 characters. (If you're interested in technical jargon, I'm referring here to the sector size of a hard disk, which is typically 512 bytes. With flash memory, the relevant quantity is the page size, which may also be hundreds or thousands of bytes.) As computer users, we never notice this small size limit for instantaneously storing data on a device, because modern drives can execute hundreds of thousands of these 500-character writes every second. But the fact remains that the disk's contents get changed only a few hundred characters at a time.
这到底和数据库有什么关系?它有一个极其重要的结论:通常情况下,计算机一次只能更新数据库中的一行。遗憾的是,上面这个非常简单的例子并没有真正体现这一点。上面的整个表格包含不到 200 个字符,因此在这个特定情况下,计算机可以一次更新两行。但通常情况下,对于任何合理大小的数据库,更改两行数据都需要执行两次单独的磁盘操作。
What on earth does this have to do with databases? It has an extremely important consequence: typically, the computer can update only a single row of a database at any one time. Unfortunately, the very small and simple example above doesn't really demonstrate this. The entire table above contains less than 200 characters, so in this particular case, it would be possible for the computer to update two rows at once. But in general, for a database of any reasonable size, altering two different rows does require two separate disk operations.
了解了这些背景知识后,我们就可以深入探讨问题的核心了。事实证明,许多看似简单的数据库更改实际上需要修改两行或更多行。而且,正如我们现在所知,修改两行不同的数据不可能通过一次磁盘操作完成,因此数据库更新将导致两次或多次磁盘操作。但计算机随时可能崩溃。如果计算机在两次磁盘操作之间崩溃,会发生什么?计算机可以重新启动,但它会忘记所有计划执行的操作,因此一些必要的更改可能从未进行过。换句话说,数据库可能会处于不一致的状态!
With these background facts established, we can get to the heart of the matter. It turns out that many seemingly simple changes to a database require two or more rows to be altered. And as we now know, altering two different rows cannot be achieved in a single disk operation, so the database update will result in some sequence of two or more disk operations. But the computer can crash at any time. What will happen if the computer crashes between two of these disk operations? The computer can be rebooted, but it will have forgotten about any operations it was planning to perform, so it's possible that some of the necessary changes were never made. In other words, the database might be left in an inconsistent state!
到目前为止,崩溃后的不一致性问题可能看起来比较学术,所以我们将通过两个例子来说明这个极其重要的问题。我们先从一个比上面更简单的数据库开始,比如:
At this stage, the whole problem of inconsistency after a crash might seem rather academic, so we'll look at two examples of this extremely important problem. Let's start with an even simpler database than the one above, say:
这个枯燥乏味的数据库记录了三个孤独的人。现在假设 Rosina 和 Jingyi 成为了朋友,我们想更新数据库以反映这件喜事。如您所见,此更新需要同时更改表的第一行和第二行——正如我们之前所讨论的,这通常需要两次单独的磁盘操作。假设第一行首先被更新。在这次更新之后,在计算机有机会执行将更新第二行的第二个磁盘操作之前,数据库将如下所示:
This very dull and depressing database lists three lonely people. Now suppose Rosina and Jingyi become friends, and we would like to update the database to reflect this happy event. As you can see, this update will require changes to both the first and second rows of the table—and as we discussed earlier, this will generally require two separate disk operations. Let's suppose that row 1 happens to get updated first. Immediately after that update, and before the computer has had a chance to execute the second disk operation that will update row 2, the database will look like this:
到目前为止一切顺利。现在数据库程序只需更新第 2 行即可完成。但是等等:如果计算机在更新之前崩溃了怎么办?计算机重启后,根本不知道第 2 行仍然需要更新。数据库将保留上面打印的内容:Rosina 是 Jingyi 的好友,但 Jingyi 不是Rosina的好友。这就是可怕的不一致。
So far, so good. Now the database program just needs to update row 2, and it will be done. But wait: what if the computer crashes before it gets a chance to do that? Then after the computer has restarted, it will have no idea that row 2 still needs to be updated. The database will be left exactly as printed above: Rosina is friends with Jingyi, but Jingyi is not friends with Rosina. This is the dreaded inconsistency.
我已经提到过,数据库从业者非常重视一致性,但目前看来,这似乎不是什么大问题。毕竟,如果 Jingyi 在一个地方被记录为好友,而在另一个地方被记录为无好友,这真的重要吗?我们甚至可以设想一个自动化工具,它会不时地扫描数据库,查找并修复类似的差异。事实上,这样的工具确实存在,并且可以在一致性次要的数据库中使用。您自己可能也遇到过这样的例子,因为某些操作系统在崩溃后重新启动时,会检查整个文件系统是否存在不一致。
I already mentioned that database practitioners are obsessed with consistency, but at this point it may not seem like such a big deal. After all, does it really matter if Jingyi is recorded as being a friend in one place and friendless in another place? We could even imagine an automated tool that scans through the database every so often, looking for discrepancies like this and fixing them. In fact, tools like this do exist and can be used in databases where consistency is of secondary importance. You may have even encountered an example of this yourself, because some operating systems, when rebooted after a crash, check the entire file system for inconsistencies.
但确实存在一些情况,不一致确实会造成严重损害,而且无法通过自动化工具纠正。一个典型的例子就是银行账户之间的转账。以下是另一个简单的数据库:
But there do exist situations in which an inconsistency is genuinely harmful and cannot be corrected by an automated tool. A classic example is the case of transferring money between bank accounts. Here's another simple database:
假设 Zadie 请求将 200 美元从她的支票账户转入她的储蓄账户。与上例一样,这将需要更新两行数据,并执行两次独立的磁盘操作。首先,Zadie 的支票账户余额将减少到 600 美元,然后她的储蓄账户余额将增加到 500 美元。如果不幸在两次操作之间发生崩溃,数据库将如下所示:
Suppose Zadie has requested to transfer $200 from her checking account to her savings account. Just as in the previous example, this is going to require two rows to be updated, using a sequence of two separate disk operations. First, Zadie's checking balance will be reduced to $600, then her savings balance will be increased to $500. And if we are unlucky enough to experience a crash between these two operations, the database will look like this:
换句话说,这对 Zadie 来说简直是一场灾难:崩溃前,Zadie 的两个账户里总共有 1100 美元,但现在只剩下 900 美元。她从未提过任何钱——但不知何故,200 美元却完全消失了!而且根本无法检测到这种情况,因为数据库在崩溃后保持了完全自洽的状态。我们在这里遇到了一种更微妙的不一致性:新数据库与崩溃前的状态不一致。
In other words, this is a complete disaster for Zadie: before the crash, Zadie had a total of $1100 in her two accounts, but now she has only $900. She never withdrew any money—but somehow, $200 has completely vanished! And there is no way to detect this, because the database is perfectly self-consistent after the crash. We have encountered a much more subtle type of inconsistency here: the new database is inconsistent with its state before the crash.
值得更详细地研究这个重要点。在我们的第一个不一致示例中,我们最终得到了一个不言而喻不一致的数据库:A与B是朋友,但B与A不是朋友。仅通过检查数据库即可检测到这种不一致性(尽管如果数据库包含数百万甚至数十亿条记录,检测过程可能非常耗时)。在我们的第二个不一致示例中,当将数据库视为在特定时间拍摄的快照时,其状态完全合理。没有规则规定账户余额必须是多少,或者这些余额之间存在任何关系。然而,如果我们检查数据库随时间的变化状态,我们可以观察到不一致的行为。这里涉及三个事实:(i)在开始转账之前,Zadie 有 1100 美元;(ii)崩盘后,她有 900 美元;(iii)在此期间,她没有提取任何钱。综合起来,这三个事实是不一致的,但通过在特定时间点检查数据库无法检测到这种不一致性。
It's worth investigating this important point in more detail. In our first example of inconsistency, we ended up with a database that was self-evidently inconsistent: A friends with B, but B not friends with A. This type of inconsistency can be detected merely by examining the database (although the detection process could be very time-consuming, if the database contains millions—or even billions—of records). In our second example of inconsistency, the database was left in a state that is perfectly plausible, when considered as a snapshot taken at a particular time. There is no rule that states what the balances of the accounts must be, or any relationships between those balances. Nevertheless, we can observe inconsistent behavior if we examine the state of the database over time. Three facts are pertinent here: (i) before initiating her transfer, Zadie had $1100; (ii) after the crash, she had $900; (iii) in the intervening period, she did not withdraw any money. Taken together, these three facts are inconsistent, but the inconsistency cannot be detected by examining the database at a particular point in time.
为了避免这两种不一致,数据库研究人员提出了“事务”的概念——一组对数据库的更改,这些更改必须全部执行才能保持数据库的一致性。如果事务中的部分更改(而非全部更改)被执行,则数据库可能会出现不一致。这是一个简单但极其强大的概念。数据库程序员可以发出类似“开始事务”的命令,然后对数据库进行一系列相互依赖的更改,最后以“结束事务”结束。即使运行数据库的计算机在事务执行过程中崩溃并重启,数据库也能保证程序员的所有更改都能完成。
To avoid both types of inconsistency, database researchers came up with the concept of a “transaction”—a set of changes to a database that must all take place if the database is to be left consistent. If some, but not all, of the changes in a transaction are performed, then the database might be left inconsistent. This is a simple but extremely powerful idea. A database programmer can issue a command like “begin transaction,” then make a bunch of interdependent changes to the database, and finish with “end transaction.” The database will guarantee that the programmer's changes will all be accomplished, even if the computer running the database crashes and restarts in the middle of the transaction.
为了绝对正确,我们应该意识到还有另一种可能性:崩溃和重启后,数据库可能会恢复到事务开始前的精确状态。但如果发生这种情况,程序员将收到事务失败且必须重新提交的通知——因此不会造成任何损害。我们将在稍后关于“回滚”事务的部分更详细地讨论这种可能性。但目前,关键在于无论事务完成还是回滚,数据库都保持一致。
To be absolutely correct, we should be aware that there is another possibility too: it's possible that after a crash and restart, the database will return to the exact state it was in before the transaction began. But if this happens, the programmer will receive a notification that the transaction failed and must be resubmitted—so no harm is done. We'll be discussing this possibility in greater detail later, in the section about “rolling back” transactions. But for now, the crucial point is that the database remains consistent regardless of whether a transaction is completed or rolled back.
从目前的描述来看,我们似乎对崩溃的可能性过于执着,这毫无必要,毕竟,在运行现代应用程序的现代操作系统上,崩溃的情况非常罕见。对此有两种回应。首先,“崩溃”的概念在这里相当宽泛:它涵盖了任何可能导致计算机停止运行并因此丢失数据的事件。这些可能性包括电源故障、磁盘故障、其他硬件故障以及操作系统或应用程序中的 bug。其次,即使这些普遍存在的崩溃情况相当罕见,但有些数据库无法承担这种风险:银行、保险公司以及任何其他数据代表实际资金的组织,在任何情况下都无法承受其记录中的不一致。
From the description so far, it may seem that we are obsessing unnecessarily over the possibility of crashes, which are, after all, very rare on modern operating systems running modern application programs. There are two responses to this. First, the notion of “crash” as it applies here is rather general: it encompasses any incident that might cause the computer to stop functioning and thus lose data. The possibilities include power failure, disk failure, other hardware malfunctions, and bugs in the operating system or application programs. Second, even if these generalized crashes are rather rare, some databases cannot afford to take the risk: banks, insurance companies, and any other organization whose data represents actual money cannot afford inconsistency in their records, under any circumstances.
上述解决方案(开始一个事务,执行尽可能多的操作,然后结束事务)听起来可能好得令人难以置信。事实上,它可以通过接下来描述的相对简单的“待办事项列表”技巧来实现。
The simplicity of the solution described above (begin a transaction, perform as many operations as necessary, then end the transaction) might sound too good to be true. In fact, it can be achieved with the relatively simple “to-do list” trick described next.
待办事项清单技巧
The To-Do List Trick
并非所有人都有幸能够井井有条。但无论我们自己是否井井有条,我们都见过那些组织严密的人所使用的强大武器之一:“待办事项”清单。也许你自己并不喜欢列清单,但它们的实用性毋庸置疑。如果你一天有 10 项事情要做,那么把它们写下来——最好是按有效的顺序排列——将是一个非常好的开始。如果你在一天中分心(或者,我们应该说“崩溃”?),待办事项清单会特别有用。如果你因为某种原因忘记了剩下的事情,快速浏览一下清单就能让你想起它们。
Not all of us are lucky enough to be well organized. But whether or not we are well organized ourselves, we've all seen one of the great weapons wielded by highly organized people: the “to-do” list. Perhaps you are not a fan of making lists yourself, but it's hard to argue with their usefulness. If you have 10 errands to get done in one day, then writing them down—preferably in an efficient ordering—makes for a very good start. A to-do list is especially useful if you get distracted (or, shall we say, “crash”?) in the middle of the day. If you forget your remaining errands for any reason, a quick glance at the list will remind you of them.
数据库事务是通过一种特殊的待办事项列表来实现的。因此,我们称之为“待办事项列表”技巧,尽管计算机科学家使用“预写日志”来表示同样的想法。其基本思想是维护数据库计划执行操作的日志。该日志存储在硬盘或其他永久存储器中,因此日志中的信息即使在崩溃和重启后也能保留下来。在执行给定事务中的任何操作之前,所有操作都会记录在日志中,从而保存到磁盘。如果事务成功完成,我们可以通过从日志中删除事务的待办事项列表来节省一些空间。因此,上面描述的 Zadie 的转账交易将分为两个主要步骤进行。首先,数据库表保持不变,然后将事务的待办事项列表写入日志:
Database transactions are achieved using a special kind of to-do list. That's why we'll call it the “to-do list” trick, although computer scientists use the term “write-ahead logging” for the same idea. The basic idea is to maintain a log of actions the database is planning to take. The log is stored on a hard drive or some other permanent storage, so information in the log will survive crashes and restarts. Before any of the actions in a given transaction are performed, they are all recorded in the log and thus saved to the disk. If the transaction completes successfully, we can save some space by deleting the transaction's to-do list from the log. So Zadie's money-transfer transaction described above would take place in two main steps. First, the database table is left untouched and we write the transaction's to-do list in the log:
确保日志条目已保存到某些永久存储器(例如磁盘)后,我们对表本身进行计划的更改:
After ensuring the log entries have been saved to some permanent storage such as a disk, we make the planned changes to the table itself:
假设更改已保存到磁盘,现在可以删除日志条目。
Assuming the changes have been saved to disk, the log entries can now be deleted.
但这只是一个简单的例子。如果电脑在交易过程中意外崩溃了怎么办?和之前一样,我们假设崩溃发生在 Zadie 的支票账户被扣款之后,但在她的储蓄账户被存入之前。电脑重新启动,数据库也重启,并在硬盘上找到以下信息:
But that was the easy case. What if the computer crashes unexpectedly in the middle of the transaction? As before, let's assume the crash occurs after Zadie's checking account has been debited, but before her savings account is credited. The computer reboots and the database restarts, finding the following information on the hard drive:
现在,计算机可以判断它在崩溃时可能正处于事务执行过程中,因为日志中包含了一些信息。但是日志中列出了四个计划中的操作。我们如何判断哪些操作已经在数据库上执行过,哪些操作尚未执行?这个问题的答案非常简单:这无关紧要!因为数据库日志中的每个条目都是经过精心设计的,无论执行一次、两次还是其他次数,其效果都相同。
Now, the computer can tell that it may have been in the middle of a transaction when it crashed, because the log contains some information. But there are four planned actions listed in the log. How can we tell which ones have already been performed on the database and which ones remain to be done? The answer to this question is delightfully simple: it doesn't matter! The reason is that every entry in a database log is constructed so that it has the same effect whether it is performed once, twice, or any other number of times.
用专业术语来说,这叫做幂等性,因此计算机科学家会说日志中的每个操作都必须是幂等的。例如,看一下第 2 条条目“将 Zadie 的支票账户余额从 800 美元更改为 600 美元”。无论 Zadie 的余额被设置为 600 美元多少次,最终效果都是一样的。因此,如果数据库在崩溃后恢复时,在日志中看到了这条条目,它就可以安全地执行该操作,而不必担心该操作是否在崩溃前已经执行过。
The technical word for this is idempotent, so a computer scientist would say that every action in the log must be idempotent. As an example, take a look at entry number 2, “Change Zadie checking from $800 to $600.” No matter how many times Zadie's balance is set to $600, the final effect will be the same. So if the database is recovering after a crash and it sees this entry in the log, it can safely perform the action without worrying about whether it was already performed before the crash too.
因此,在崩溃后恢复时,数据库只需重放任何已完成事务的已记录操作即可。处理未完成的事务也很容易。任何未以“结束事务”条目结尾的已记录操作集合都会以相反的顺序撤消,使数据库就像事务从未开始过一样。我们将在第134页讨论复制数据库时再次讨论“回滚”事务的概念。
Thus, when recovering after a crash, a database can just replay the logged actions of any complete transactions. And it's easy to deal with incomplete transactions too. Any set of logged actions that doesn't finish with an “end transaction” entry simply gets undone in reverse order, leaving the database as if the transaction had never begun. We'll return to this notion of “rolling back” a transaction in the discussion of replicated databases on page 134.
大原子性和小原子性
Atomicity, in the Large and in the Small
还有另一种理解事务的方式:从数据库用户的角度来看,每个事务都是原子的。尽管物理学家几十年前就知道如何拆分原子,但“原子”的本义源自希腊语,意为“不可分割”。计算机科学家所说的“原子”指的是这个本义。因此,原子事务无法拆分成更小的操作:要么整个事务成功完成,要么数据库保持其原始状态,就像事务从未启动过一样。
There is another way of understanding transactions: from the point of view of the database user, every transaction is atomic. Although physicists have known how to split atoms for many decades, the original meaning of “atomic” came from Greek, where it means “indivisible.” When computer scientists say “atomic,” they are referring to this original meaning. Thus, an atomic transaction cannot be divided into smaller operations: either the whole transaction completes successfully, or the database is left in its original condition, as if the transaction had never been started.
因此,待办事项列表技巧为我们提供了原子事务,从而保证了一致性。这是我们典型示例中的关键要素:一个高效且完全可靠的网上银行数据库。然而,我们还没有达到那一步。一致性本身并不能产生足够的效率或可靠性。当与稍后描述的锁定技术结合使用时,即使数千名客户同时访问数据库,待办事项列表技巧也能保持一致性。这确实产生了巨大的效率,因为可以同时为许多客户提供服务。并且待办事项列表技巧也提供了很好的可靠性衡量标准,因为它可以防止不一致性。具体而言,待办事项列表技巧可以防止数据损坏,但不能消除数据丢失。我们的下一个数据库技巧 - 准备然后提交技巧 - 将在防止任何数据丢失的目标上取得重大进展。
So, the to-do list trick gives us atomic transactions, which in turn guarantee consistency. This is a key ingredient in our canonical example: an efficient and completely reliable database for online banking. We are not there yet, however. Consistency does not, by itself, yield adequate efficiency or reliability. When combined with the locking techniques to be described shortly, the to-do list trick maintains consistency even when thousands of customers are simultaneously accessing the database. This does yield tremendous efficiency, because many customers can be served at once. And the to-do list trick also provides a good measure of reliability, since it prevents inconsistencies. Specifically, the to-do list trick precludes data corruption, but does not eliminate data loss. Our next database trick—the prepare-then-commit trick—will produce significant progress toward the goal of preventing any loss of data.
复制数据库的准备-提交技巧
THE PREPARE-THEN-COMMIT TRICK FOR REPLICATED DATABASES
我们将继续探索精妙的数据库技术,并探讨一种我们称之为“准备后提交技巧”的算法。为了理解这个技巧,我们需要了解有关数据库的另外两个事实:首先,数据库通常是复制的,这意味着数据库的多个副本存储在不同的地方;其次,有时必须取消数据库事务,这也称为“回滚”或“中止”事务。在继续讨论“准备后提交技巧”之前,我们将简要介绍这两个概念。
Our journey through ingenious database techniques continues with an algorithm we'll call the “prepare-then-commit trick.” To motivate this trick, we need to understand two more facts about databases: first, they are often replicated, which means that multiple copies of the database are stored in different places; and second, database transactions must sometimes be canceled, which is also called “rolling back” or “aborting” a transaction. We'll briefly cover these two concepts before moving on to the prepare-then-commit trick.
复制数据库
Replicated Databases
待办事项列表技巧允许数据库从某些类型的崩溃中恢复,方法是完成或回滚崩溃时正在进行的任何事务。但这假设崩溃前保存的所有数据仍然存在。如果计算机硬盘永久损坏,部分或全部数据丢失怎么办?这只是计算机遭受永久性数据丢失的众多原因之一。其他原因包括软件错误(在数据库程序本身或操作系统中)和硬件故障。任何问题都可能导致计算机覆盖您认为安全存储在硬盘上的数据,将其清除并用垃圾数据替换。显然,待办事项列表技巧在这里帮不上忙。
The to-do list trick allows databases to recover from certain types of crashes, by completing or rolling back any transactions that were in progress at the time of the crash. But this assumes all the data that was saved before the crash is still there. What if the computer's hard drive is permanently broken and some or all of the data is lost? This is just one of many ways that a computer can suffer from permanent data loss. Other causes include software bugs (in the database program itself or in the operating system) and hardware failures. Any of these problems can cause a computer to overwrite data that you thought was safely stored on the hard drive, wiping it out and replacing it with garbage. Clearly, the to-do list trick can't help us here.
然而,在某些情况下,数据丢失是绝对不允许的。如果您的银行丢失了您的账户信息,您会非常沮丧,而银行也可能面临严重的法律和经济处罚。同样,如果一家股票经纪公司执行了您的订单,却丢失了交易详情,情况也是如此。事实上,任何拥有大量在线销售的公司(eBay 和亚马逊就是典型的例子)都无法承受丢失或损坏任何客户信息的后果。但在拥有数千台计算机的数据中心,许多组件(尤其是硬盘)每天都会发生故障。这些组件上的数据每天都在丢失。面对如此严重的数据损失,您的银行该如何保障您的数据安全呢?
However, data loss is simply not an option in some circumstances. If your bank loses your account information, you will be extremely upset, and the bank could face serious legal and financial penalties. The same goes for a stockbroking firm that executes an order you've placed, but then loses the details of the sale. Indeed, any company with substantial online sales (eBay and Amazon being prime examples) simply cannot afford to lose or corrupt any customer information. But in a data center with thousands of computers, many components (especially hard drives) fail every day. The data on these components is lost every single day. How can your bank keep your data safe in the face of this onslaught?
显而易见且广泛使用的解决方案是维护数据库的两个或多个副本。数据库的每个副本称为“副本”,所有副本的集合称为“复制数据库”。通常,这些副本在地理位置上是分开的(可能位于相距数百英里的不同数据中心),这样即使其中一个副本因自然灾害而损毁,另一个副本仍然可用。
The obvious, and widely used, solution is to maintain two or more copies of the database. Each copy of the database is called a replica, and the set of all copies taken together is called a replicated database. Often, the replicas are geographically separated (perhaps in different data centers that are hundreds of miles apart), so that if one of them is wiped out by a natural disaster, another replica is still available.
我曾听过一位计算机公司高管讲述2001年9月11日纽约世贸中心双子塔遭受恐怖袭击后,公司客户的经历。这家计算机公司在双子塔有五个主要客户,所有客户都运行着跨地域复制的数据库。其中四个客户能够依靠幸存的数据库副本继续基本不间断地运营。不幸的是,第五个客户在每座双子塔都各有一个副本,结果两个副本都丢失了!这位客户只能从异地存档备份中恢复数据库,才能恢复运营。
I once heard a computer company executive describe the experiences of the company's customers after the September 11, 2001 terrorist attacks on the twin towers of New York's World Trade Center. The computer company had five major customers in the twin towers, and all were running geographically replicated databases. Four of the five customers were able to continue their operations essentially uninterrupted on surviving database replicas. The fifth customer, unfortunately, had one replica in each tower and lost both! This customer could only resume operations after restoring its database from off-site archival backups.
请注意,复制数据库的行为与我们熟悉的“备份”数据的概念截然不同。备份是特定时间点对某些数据进行的快照——对于手动备份,快照是在运行备份程序时拍摄的,而自动备份通常每周或每天在特定时间(例如每天凌晨 2 点)拍摄系统快照。换句话说,备份是对某些文件、数据库或其他需要备用副本的数据的完整复制。
Note that a replicated database behaves quite differently to the familiar concept of keeping a “backup” of some data. A backup is a snapshot of some data at a particular time—for manual backups, the snapshot is taken at the time you run your backup program, whereas automated backups often take a snapshot of a system at a particular time on a weekly or daily basis, such as every morning at 2 a.m. In other words, a backup is a complete duplicate of some files, or a database, or anything else for which you need a spare copy.
但根据定义,备份不一定是最新的:如果在备份后进行了某些更改,这些更改不会保存在其他任何地方。相比之下,复制数据库会始终保持数据库所有副本同步。每当数据库中任何条目发生哪怕是最轻微的更改时,所有副本都必须立即进行该更改。
But a backup is, by definition, not necessarily up-to-date: if some changes are made after a backup is performed, those changes are not saved anywhere else. In contrast, a replicated database keeps all copies of the database in sync at all times. Every time the slightest change is made to any entry in the database, all of the replicas must make that change immediately.
显然,复制是防止数据丢失的绝佳方法。但复制也存在风险:它又会引入另一种可能的不一致性。如果一个副本的数据最终与另一个副本不同,我们该怎么办?这样的副本彼此不一致,并且可能难以甚至无法确定哪个副本拥有正确的数据版本。在研究如何回滚事务之后,我们将再次讨论这个问题。
Clearly, replication is an excellent way to guard against lost data. But replication has dangers, too: it introduces yet another type of possible inconsistency. What are we going to do if a replica somehow ends up with data that differs from another replica? Such replicas are inconsistent with each other, and it may be difficult or impossible to determine which replica has the correct version of the data. We will return to this issue after investigating how to roll back transactions.
回滚事务
Rolling Back Transactions
冒着重复的风险,我们来仔细回顾一下事务究竟是什么:它是对数据库进行的一系列更改,这些更改必须全部执行才能保证数据库保持一致。在之前关于事务的讨论中,我们主要关注的是确保即使数据库在事务执行过程中崩溃,事务也能顺利完成。
At the risk of being a little repetitive, let's try to recall exactly what a transaction is: it's a set of changes to a database that must all take place to guarantee the database remains consistent. In the earlier discussion of transactions, we were mostly concerned with making sure that a transaction would complete even if the database crashed in the middle of the transaction.
但有时由于某些原因,事务无法完成。例如,事务可能涉及向数据库添加大量数据,而计算机在事务进行到一半时磁盘空间不足。这种情况非常罕见,但却非常重要。
But it turns out that sometimes it is impossible to complete a transaction for some reason. For example, perhaps the transaction involves adding a large amount of data to the database, and the computer runs out of disk space halfway through the transaction. This is a very rare, but nevertheless important, scenario.
事务无法完成的一个更常见原因与另一个称为锁定的数据库概念有关。在繁忙的数据库中,通常会同时执行多个事务。(想象一下,如果您的银行一次只允许一位客户转账,会发生什么——这个网上银行系统的性能将非常糟糕。)但通常重要的是,在事务期间保持数据库的某些部分处于冻结状态。例如,如果事务A正在更新一条记录,以记录 Rosina 现在是 Jingyi 的朋友,那么如果同时运行的事务B将 Jingyi 从数据库中完全删除,那将是灾难性的。因此,事务A将“锁定”数据库中包含 Jingyi 信息的部分。这意味着数据被冻结,任何其他事务都无法更改它。在大多数数据库中,事务可以锁定单个行或列,或整个表。显然,一次只有一个事务可以锁定数据库的特定部分。一旦事务成功完成,它将“解锁”其锁定的所有数据,此后其他事务可以自由地更改先前冻结的数据。
A much more common reason for failing to complete a transaction relates to another database concept called locking. In a busy database, there are usually many transactions executing at the same time. (Imagine what would happen if your bank permitted only one customer to transfer money at any one time—the performance of this online banking system would be truly appalling.) But it is often important that some part of the database remains frozen during a transaction. For example, if transaction A is updating an entry to record that Rosina is now friends with Jingyi, it would be disastrous if a simultaneously running transaction B deleted Jingyi from the database altogether. Therefore, transaction A will “lock” the part of the database containing Jingyi's information. This means the data is frozen, and no other transaction can change it. In most databases, transactions can lock individual rows or columns, or entire tables. Obviously, only one transaction can lock a particular part of the database at any one time. Once the transaction completes successfully, it “unlocks” all of the data it has locked, and after this point other transactions are free to alter the previously frozen data.
死锁:当两个事务A和B都尝试锁定相同的行(但顺序相反)时,它们就会陷入死锁,并且都无法继续进行。
Deadlock: When two transactions, A and B, both try to lock the same rows—but in the opposite order—they become deadlocked, and neither can proceed.
乍一看,这似乎是一个绝佳的解决方案,但它可能导致一种非常糟糕的情况,计算机科学家称之为“死锁”,如上图所示。假设两个长事务A和 B 同时运行。最初,如图上图所示,数据库中没有任何行被锁定。之后,
At first this seems like an excellent solution, but it can lead to a very nasty situation that computer scientists call a deadlock, as demonstrated in the figure above. Let's suppose that two long transactions, A and B, are running simultaneously. Initially, as in the top panel of the figure, none of the rows in the database are locked. Later,
如中间面板所示,A锁定了包含 Marie 信息的行,B锁定了包含 Pedro 信息的行。一段时间之后,A发现它需要锁定 Pedro 的行,而B发现它需要锁定 Marie 的行 — — 这种情况在图的底部面板中表示。请注意,A现在需要锁定 Pedro 的行,但一次只能有一个事务锁定任何行,而B已经锁定了 Pedro 的行!因此A需要等到B完成。但是B必须锁定 Marie 的行才能完成,而该行当前被 A 锁定。因此B需要等到A完成。A和B处于死锁状态,因为彼此都必须等待对方继续进行。它们将永远陷入僵局,并且这些事务永远不会完成。
as shown in the middle panel, A locks the row containing Marie's information, and B locks the row containing Pedro's information. Some time after this, A discovers that it needs to lock Pedro's row, and B discovers that it needs to lock Marie's row—this situation is represented in the bottom panel of the figure. Note that A now needs to lock Pedro's row, but only one transaction can lock any row at one time, and B has already locked Pedro's row! So A will need to wait until B finishes. But B can't finish until it locks Marie's row, which is currently locked by A. So B will need to wait until A finishes. A and B are deadlocked, because each must wait for the other to proceed. They will be stuck forever, and these transactions will never complete.
计算机科学家对死锁进行了深入研究,许多数据库会定期运行一个特殊的死锁检测程序。当发现死锁时,其中一个死锁事务会被取消,以便另一个事务可以继续进行。但请注意,就像在事务执行过程中磁盘空间耗尽一样,这需要能够中止或“回滚”已部分完成的事务。因此,我们现在知道至少两个事务可能需要回滚的原因。还有很多其他原因,但我们无需赘述。关键在于,事务经常会因为不可预测的原因而无法完成。
Computer scientists have studied deadlocks in great detail, and many databases periodically run a special procedure for deadlock detection. When a deadlock is found, one of the deadlocked transactions is simply canceled, so that the other can proceed. But note that, just as when we run out of disk space in the middle of a transaction, this requires the ability to abort or “roll back” the transaction that has already been partially completed. So we now know of at least two reasons why a transaction might need to be rolled back. There are many others, but there's no need for us to go into details. The fundamental point is that transactions frequently fail to complete for unpredictable reasons.
只需对待办事项列表技巧稍加调整即可实现回滚:预写日志必须包含足够的附加信息,以便在必要时撤消每个操作。 (这与之前的描述形成了对比,在之前的描述中,我们强调每个日志条目都包含足够的信息,以便在崩溃后重做操作。)这在实践中很容易实现。事实上,在我们检查的简单示例中,撤消信息和重做信息是相同的。像“将 Zadie 的支票账户余额从 800 美元改为 600 美元”这样的条目可以轻松“撤消”——只需将 Zadie 的支票账户余额从 600 美元改为 800 美元即可。总结一下:如果需要回滚某个事务,数据库程序只需通过预写日志(即待办事项列表)向后操作,撤销该事务中的每个操作。
Rollback can be achieved using a slight tweak to the to-do list trick: the write-ahead log must contain enough additional information to undo each operation if necessary. (This contrasts with the earlier description, in which we emphasized that each log entry contains enough information to redo the operation after a crash.) This is easy to achieve in practice. In fact, in the simple examples we examined, the undo information and redo information are identical. An entry like “Change Zadie checking from $800 to $600” can easily be “undone”—by simply changing Zadie's checking balance from $600 to $800. To summarize: if a transaction needs to be rolled back, the database program can just work backward through the write-ahead log (i.e., the to-do list), reversing each operation in that transaction.
准备后再承诺的诀窍
The Prepare-Then-Commit Trick
现在让我们思考一下在复制数据库中回滚事务的问题。这里最大的问题是,其中一个副本可能遇到需要回滚的问题,而其他副本则不会。例如,很容易想象一个副本耗尽了磁盘空间,而其他副本仍然有可用空间。
Now let's think about the problem of rolling back transactions in a replicated database. The big issue here is that one of the replicas might encounter a problem that requires rollback, while the others do not. For example, it's easy to imagine that one replica runs out of disk space while the others still have space available.
一个简单的类比就很有帮助。假设你和三个朋友都想一起看一部最近上映的电影。为了让故事更有趣,让我们把这个故事设定在 20 世纪 80 年代,电子邮件出现之前,所以这次电影之旅必须通过电话安排。你会怎么做呢?一种可能的方法如下。决定一个适合你并且——据你所知——也很可能适合你的朋友的电影日期和时间。假设你选择星期二晚上 8 点。下一步是打电话给你的一个朋友,问他或她星期二 8 点是否有空。如果答案是肯定的,你会说“太好了,请把时间写下来,我稍后会给你回电话确认。”然后你会打电话给下一个朋友,做同样的事情。最后,你打电话给第三个也是最后一个朋友,提出同样的邀请。如果每个人星期二 8 点都有空,你就最终决定确认活动并给你的朋友回电话告诉他们。
A simple analogy will help here. Suppose that you and three friends would all like to see a recently released movie together. To make things interesting, let's set this story in the 1980s, before the days of e-mail, so the movie trip is going to have to be organized by telephone. How do you go about it? One possible approach is as follows. Decide on a day and time for the movie that work for you, and—as far as you know—are likely to be suitable for your friends too. Let's suppose you choose Tuesday at 8 p.m. The next step is to call one of your friends and ask if he or she is free on Tuesday at 8. If the answer is yes, you'll say something like “great, please pencil that in, and I'll call you back later to confirm.” Then you'll call the next friend and do the same thing. Finally, you call the third and final friend with the same offer. If everyone is available on Tuesday at 8, you make the final decision to confirm the event and call back your friends to let them know.
以上只是一个简单的例子。如果其中一个朋友周二 8 点没空怎么办?在这种情况下,你需要“回滚”之前完成的所有工作,然后重新开始。实际上,你可能会给每个朋友打电话,立即提议新的日期和时间。但为了尽可能简化流程,我们假设你给每个朋友打电话说:“抱歉,周二 8 点不合适,请从日历上删除这个时间,我很快会给你回复新的提议。” 完成后,你就可以重新开始整个流程了。
That was the easy case. What happens if one of the friends is not available on Tuesday at 8? In this case, you will need to “roll back” all of the work done so far and start again. In reality, you would probably call each friend and immediately propose a new day and time, but to keep things as simple as possible here, let's instead assume that you call each friend and say “Sorry, Tuesday at 8 is no good, please erase that from your calendar, and I'll get back to you soon with a new proposal.” Once this is done, you can start the whole procedure all over again.
请注意,组织电影郊游的策略分为两个不同的阶段。在第一阶段,日期和时间已经提出,但尚未 100% 确定。一旦你发现该提议对每个人都可行,你就知道日期和时间现在 100% 确定了,但其他人还不知道。因此,在第二阶段,你要召回所有朋友进行确认。或者,如果一个或多个朋友无法参加,则第二阶段包括召回所有人取消。计算机科学家将此称为两阶段提交协议;我们称之为“准备然后提交技巧”。第一阶段称为“准备”阶段。第二阶段是“提交”阶段或“中止”阶段,具体取决于初始提议是否已被所有人接受。
Notice that there are two distinct phases in your strategy for organizing the movie outing. In the first phase, the date and time have been proposed but are not yet 100% certain. Once you find out that the proposal is feasible for everyone, you know that the date and time are now 100% certain, but everyone else does not. Therefore, there is a second phase in which you call back all of your friends to confirm. Alternatively, if one or more friends were unable to make it, the second phase consists of calling back everyone to cancel. Computer scientists call this the two-phase commit protocol; we'll call it the “prepare-then-commit trick.” The first phase is called the “prepare” phase. The second phase is either a “commit” phase or an “abort” phase, depending on whether the initial proposal has been accepted by everyone or not.
有趣的是,这个类比中涉及到数据库锁定的概念。虽然我们没有明确讨论,但你的每个朋友在安排电影观看时都做出了一个隐含的承诺:他们承诺周二8点不再安排其他活动。在他们收到你的确认或取消回复之前,日历上的这个时间段将被“锁定”,无法被任何其他“事务”更改。例如,如果在第一阶段之后、第二阶段之前的某个时间,有人打电话给你的朋友,提议周二8点去看一场篮球比赛,会发生什么?你的朋友应该这样说:“抱歉,我那个时候可能还有其他约会。在那个约会最终敲定之前,我无法给你关于篮球比赛的确切消息。”
Interestingly, there is a notion of database locking in this analogy. Although we didn't explicitly discuss it, each of your friends makes an implicit promise when they pencil in the movie outing: they are promising not to schedule something else for Tuesday at 8. Until they hear back from you with a confirmation or cancellation, that slot on the calendar is “locked” and cannot be altered by any other “transaction.” For example, what should happen if someone else calls your friend, sometime after the first phase but before the second phase, to propose watching a basketball game on Tuesday at 8? Your friend should say something like “Sorry, but I might have another appointment at that time. Until that appointment is finalized, I can't give you a firm answer about the basketball game.”
现在让我们研究一下“准备然后提交”技巧对于复制数据库是如何运作的。下一页的图演示了这个想法。通常,其中一个副本是协调事务的“主服务器”。具体来说,假设有三个副本,A、 B 和C,其中A是主服务器。假设数据库需要执行一个事务,将新数据插入表中。准备阶段从A锁定该表开始,然后将新数据写入其预写日志。同时,A将新数据发送给B和C,它们也锁定自己的表副本并将新数据写入其日志中。然后, B和C向A报告它们是否成功执行了此操作。现在开始第二阶段。如果 A、B 或C中的任何一个遇到问题(例如磁盘空间不足或无法锁定表),主服务器A就知道必须回滚事务,并将此情况通知所有副本 - 参见第 140 页的图。但是,如果所有副本都报告准备阶段成功,则A会向每个副本发送一条消息以确认事务,然后副本会完成该事务(如下页的图所示)。
Let's now examine how the prepare-then-commit trick works for a replicated database. The figure on the next page demonstrates the idea. Typically, one of the replicas is the “master” that coordinates the transaction. To be specific, suppose there are three replicas, A, B, and C, with A being the master. Suppose the database needs to execute a transaction that inserts a new row of data into a table. The prepare phase begins with A locking that table, then writing the new data into its write-ahead log. At the same time, A sends the new data to B and C, which also lock their own copies of the table and write the new data in their logs. B and C then report back to A on whether they succeeded or failed in doing this. Now the second phase begins. If any of A, B, or C encountered a problem (such as running out of disk space or failing to lock the table), the master A knows that the transaction must be rolled back and informs all replicas of this—see the figure on page 140. But if all the replicas reported success from their prepare stages, A sends a message to each replica confirming the transaction, and the replicas then complete it (as in the figure on the next page).
到目前为止,我们已经掌握了两种数据库技巧:待办事项列表技巧和“准备后提交”技巧。它们能给我们带来什么?通过结合这两种技巧,您的银行(以及任何其他在线实体)可以实现具有原子事务的复制数据库。这使得我们能够同时高效地为数千名客户提供服务,并且几乎不会出现任何不一致或数据丢失的情况。然而,我们还没有深入数据库的核心:数据是如何构建的?查询是如何响应的?我们最后的数据库技巧将为这些问题提供一些答案。
So far we have two database tricks at our disposal: the to-do list trick and the prepare-then-commit trick. What do they buy us? By combining the two tricks, your bank—and any other online entity—can implement a replicated database with atomic transactions. And this permits simultaneous, efficient service to thousands of customers, with essentially zero chance of any inconsistency or data loss. However, we have not yet looked into the heart of the database: how is the data structured, and how are queries answered? Our final database trick will provide some answers to these questions.
关系数据库和虚拟表技巧
RELATIONAL DATABASES AND THE VIRTUAL TABLE TRICK
到目前为止,我们所有的示例中,数据库都只包含一个表。但现代数据库技术的真正威力,在拥有多个表的数据库中才能得到充分释放。其基本思想是,每个表存储一组不同的信息,但各个表中的实体通常以某种方式相互关联。因此,一家公司的数据库可能包含单独的表,用于存储客户信息、供应商信息和产品信息。但是,客户表可能会提及产品表中的商品,因为客户订购了产品。产品表也可能会提及供应商表中的商品,因为产品是由供应商的货物制造的。
In all of our examples so far, the database has consisted of exactly one table. But the true power of modern database technology is unleashed in databases that have multiple tables. The basic idea is that each table stores a different set of information, but that entities in the various tables are often connected in some way. So a company's database might consist of separate tables for customer information, supplier information, and product information. But the customer table might mention items in the product table, because customers order products. And perhaps the product table will mention items in the supplier table, because products are manufactured from the suppliers' goods.
准备后提交技巧:主副本A协调另外两个副本(B、C)向表中添加一些新数据。在准备阶段,主副本会检查所有副本是否能够完成事务。一旦所有副本都确认无误,主副本就会通知所有副本提交数据。
The prepare-then-commit trick: The master replica, A, coordinates two other replicas (B, C) to add some new data to the table. In the prepare phase, the master checks whether all replicas will be able to complete the transaction. Once it gets the all clear, the master tells all replicas to commit the data.
带回滚的“准备后提交”技巧:此图的上图与上图完全相同。但在准备阶段,其中一个副本遇到了错误。因此,下图是“中止”阶段,每个副本都必须回滚事务。
The prepare-then-commit trick with rollback: The top panel of this figure is exactly the same as in the previous figure. But during the prepare phase, one of the replicas encounters an error. As a result, the bottom panel is an “abort” phase in which each replica must roll back the transaction.
让我们看一个小而真实的例子:一所大学存储的信息,详细说明哪些学生选修了哪些课程。为了便于管理,该示例只包含少量学生和课程,但希望能够清楚地表明,当数据量大得多时,同样的原则也适用。
Let's take a look at a small but real example: the information stored by a college, detailing which students are taking which courses. To keep things manageable, the example will have only a handful of students and courses, but hopefully it will be clear that the same principles apply when the amount of data is much larger.
首先,让我们看一下本章迄今为止一直使用的简单的单表存储方法是如何存储数据的。下一页图表的上图显示了这一方法。如您所见,该数据库有 10 行 5 列;衡量数据库中信息量的一种简单方法是假设数据库中有 10 × 5 = 50 个数据项。现在,请花几秒钟时间仔细研究下一页图表的上图。这种数据存储方式有什么让您感到困扰的地方吗?例如,您能看到任何不必要的数据重复吗?您能想到一种更高效的存储相同信息的方法吗?
First, let's look at how the data might be stored in the simple, one-table approach we've been using so far in this chapter. This is shown in the top panel of the figure on the following page. As you can see, there are 10 rows in this database and 5 columns; a simple way of measuring the amount of information in the database is to say there are 10 × 5 = 50 data items in the database. Spend a few seconds now studying the top panel of the figure on the next page more closely. Is there anything that irks you about the way this data is stored? For instance, can you see any unnecessary repetition of data? Can you think of a more efficient way of storing the same information?
您可能已经意识到,每门课程的很多信息对于每个选修该课程的学生来说都是重复的。例如,有三名学生选修了 ARCH101 课程,这门课程的详细信息(包括课程名称、授课教师和教室号)对于这三名学生来说都是重复的。存储这些信息的更有效方法是使用两个表:一个表用于存储哪些学生选修了哪些课程,另一个表用于存储每门课程的详细信息。这种双表方法如下一页图表的底部面板所示。
You probably realized that a lot of information about each course is duplicated for each student that takes the course. For example, three students take ARCH101, and the detailed information about this course (including its title, instructor, and room number) is repeated for each of the three students. A much more effective way of storing this information is to use two tables: one to store which courses are taken by which students, and another to store the details about each course. This two-table approach is shown in the bottom panel of the figure on the following page.
我们一眼就能看出这种多表方法的一个优点:所需的总存储量减少了。新方法使用一个包含 10 行 2 列(即 10 × 2 = 20 个项目)的表,以及另一个包含 3 行 4 列(即 3 × 4 = 12 个项目)的表,总共包含 32 个项目。相比之下,单表方法需要 50 个项目才能存储完全相同的信息。
We can immediately see one of the advantages of this multitable approach: the total amount of storage required is reduced. This new approach uses one table with 10 rows and 2 columns (i.e., 10? 2 = 20 items), and a second table with 3 rows and 4 columns (i.e., 3 × 4 = 12 items), resulting in a total of 32 items. In contrast, the one-table approach needed 50 items to store exactly the same information.
这种节省是如何实现的?这来自于重复信息的消除:我们不再为每个学生选修的每门课程重复列出课程名称、教师和房间号,而是对每门课程只列出一次这些信息。不过,为了实现这一点,我们也做出了一些牺牲:现在课程号出现在两个不同的地方,因为两个表中都有“课程号”列。因此,我们用大量的重复(课程详情)换取了少量的重复(课程号)。总的来说,这是一笔划算的交易。在这个小例子中,收益并不大,但您可能会看到,如果每门课程有数百名学生,那么这种方法节省的存储空间将是巨大的。
How did this saving come about? It comes from the elimination of repeated information: instead of repeating the course title, instructor, and room number for each course taken by each student, this information is listed exactly once for each course. We have sacrificed something to achieve this, though: now the course numbers appear in two different places, since there is a “course number” column in both tables. So we have traded a large amount of repetition (of the course details) for a small amount of repetition (of the course numbers). Overall, this works out to be a good deal. The gains in this small example are not huge, but you can probably see that if there are hundreds of students taking each course, the storage savings from this approach would be enormous.
上图:用于存储学生课程的单表数据库。
下图:相同数据以更高效的双表形式存储。
Top: Single-table database for students' courses.
Bottom: The same data stored more efficiently, in two tables.
多表方法还有另一个巨大的优势。如果表设计正确,那么对数据库的更改会更加容易。例如,假设 MATH314 的房间号从 560 更改为 440。在单表方法中(上一页图的顶部),需要更新四行数据——而且,正如我们之前所讨论的,这四行更新需要包含在一个事务中,以确保数据库保持一致。但在多表方法中(下一页图的底部),只需要进行一次更改,即更新课程详情表中的单个条目。
There is another big advantage of the multitable approach. If the tables are designed correctly, then changes to the database can be made more easily. For example, suppose the room number for MATH314 has changed from 560 to 440. In the one-table approach (top of the figure on the previous page), four separate rows would need to be updated—and, as we discussed earlier, these four updates would need to be wrapped in a single transaction to ensure that the database remains consistent. But in the multitable approach (bottom of the figure on the facing page), only one change is required, updating a single entry in the table of course details.
按键
Keys
这里值得指出的是,虽然这个简单的“学生-课程”示例仅使用两个表即可高效地表示,但实际数据库通常包含许多表。我们很容易想象用新表扩展我们的“学生-课程”示例。例如,可以有一个表包含每个学生的详细信息,例如学号、电话号码和家庭住址。也可以为每位教师创建一个表,列出电子邮件地址、办公室地址和办公时间。每个表的设计都使其大多数列存储在其他任何地方都不会重复的数据——这样做的目的是,每当需要某个对象的详细信息时,我们都可以在相关表中“查找”这些详细信息。
It's worth pointing out here that, while this simple student-courses example is most efficiently represented using only two tables, real databases often incorporate many tables. It is easy to imagine extending our student-courses example with new tables. For example, there could be a table containing details for each student, such as a student ID number, phone number, and home address. There could be a table for each instructor, listing e-mail address, office location, and office hours. Each table is designed so that most of its columns store data that is not repeated anywhere else—the idea is that whenever details about a certain object are required, we can “look up” those details in the relevant table.
在数据库术语中,用于在表中“查找”详细信息的任何列都称为键。例如,让我们思考一下如何找到 Luigi 的历史课的房间号。使用上一页图上面板的单表方法,我们只需扫描行直到找到 Luigi 的历史课,查看房间号列,然后观察答案,在本例中为 851。但是,在同一张图的下图的多表方法中,我们首先扫描第一个表以找到 Luigi 历史课的课程号——结果是“HIST256”。然后,我们使用“HIST256”作为另一个表中的键:我们通过找到包含“HIST256”作为课程号的行来查找这门课程的详细信息,然后遍历该行找到房间号(同样是 851)。此过程如下一页的图所示。
In database terminology, any column that is used to “look up” details in a table is called a key. For example, let's think about how we would find out the room number for Luigi's history class. Using the single-table approach of the upper panel of the figure on the previous page, we just scan the rows until we find Luigi's history class, look across to the room number column, and observe the answer, which in this case is 851. But in the multitable approach of the same figure's lower panel, we initially scan the first table to find the course number of Luigi's history class—this turns out to be “HIST256.” Then we use “HIST256” as a key in the other table: we look up the details for this course by finding the row containing “HIST256” as its course number, then move across that row to find the room number (again, 851). This process is shown in the figure on the following page.
像这样使用键的妙处在于,数据库可以非常高效地查找键。这与人类在字典中查找单词的方式类似。想象一下,你会如何在一本纸质字典中查找“认识论”(epistemology)这个词。当然,你不会从第一页开始浏览所有条目来寻找“认识论”。相反,你会通过查看页面标题来快速缩小搜索范围,先将页面翻到大块,然后随着接近目标页面,逐渐缩小到小块。数据库使用相同的技术查找键,但它们的效率甚至比人类更高。这是因为数据库可以预先计算需要翻阅的页面“块”,并记录每个块开头和结尾的标题。一组用于快速查找键的预先计算的块在计算机科学中被称为B 树。 B 树是现代数据库的另一个重要且巧妙的想法,但不幸的是,对 B 树的详细讨论会让我们离题太远。
The beauty of using keys like this is that databases can look up keys with superb efficiency. This is done in a similar fashion to the way a human looks up a word in a dictionary. Think about how you would go about finding the word “epistemology” in a printed dictionary. Naturally, you would not start at the first page and scan through every entry looking for “epistemology.” Instead, you quickly narrow in on the word by looking at the page headings, initially turning the pages in large clumps and gradually reverting to smaller clumps as you get close to your goal. Databases look up keys using the same technique, but they are even more efficient than humans. This is because the database can precalculate the “clumps” of pages that will be turned and keep a record of the headings at the start and end of each clump. A set of precalculated clumps for fast key lookup is known in computer science as a B-tree. The B-tree is yet another crucial and ingenious idea underpinning modern databases, but a detailed discussion of B-trees would, unfortunately, lead us too far afield.
使用键查找数据:为了找到 Luigi 历史课的教室号,我们首先在左侧表中找到相关的课程号。然后,将此值“HIST256”用作另一个表的键。由于课程号列按字母顺序排列,我们可以非常快速地找到正确的行,然后获得相应的教室号 (851)。
Looking up data using a key: To find out the room number for Luigi's history course, we first find the relevant course number in the left-hand table. This value, “HIST256,” is then used as a key in the other table. Because the column of course numbers is sorted in alphabetical order, we can find the correct row very quickly, then obtain the corresponding room number (851).
虚拟桌子技巧
The Virtual Table Trick
我们即将领略现代多表数据库背后的精妙之处。其基本思想很简单:尽管数据库的所有信息都存储在一组固定的表中,但数据库可以在需要时生成全新的临时表。我们称之为“虚拟表”,以强调它们实际上从未存储在任何地方——数据库在需要它们响应数据库查询时才会创建它们,然后立即删除它们。
We are nearly ready to appreciate the main ingenious trick behind modern multitable databases. The basic idea is simple: although all of a database's information is stored in a fixed set of tables, a database can generate completely new, temporary tables whenever it needs to. We'll call these “virtual tables” to emphasize the fact that they are never really stored anywhere—the database creates them whenever they are needed to answer a query to the database and then immediately deletes them.
一个简单的例子将演示虚拟表技巧。假设我们从第142页图表下方面板的数据库开始,用户输入一个查询,要求查找所有参加Kirby教授课程的学生姓名。数据库实际上可以通过几种不同的方式处理此查询;我们仅讨论其中一种可能的方法。第一步是创建一个新的虚拟表,列出所有课程的学生和教师。这可以通过一种称为“两表连接”的特殊数据库操作来完成。其基本思想是将一个表的每一行与另一个表的每个对应行组合起来,其中的对应关系由两个表中都存在的键列建立。例如,当我们使用“课程号”列作为键连接第142页图表下方面板的两个表时,结果将是一个与图表上方面板完全相同的虚拟表——每个学生姓名都与第二个表中相关课程的所有详细信息组合在一起,并使用“课程号”作为键查找这些详细信息。当然,原始查询是关于学生姓名和导师的,所以我们不需要任何其他列。幸运的是,数据库包含一个投影操作,可以让我们丢弃不感兴趣的列。因此,在执行连接操作合并两个表之后,再执行投影操作以消除一些不必要的列,数据库会生成以下虚拟表:
A simple example will demonstrate the virtual table trick. Suppose we start with the database of the lower panel of the figure on page 142, and a user enters a query asking for the names of all students taking classes from Professor Kirby. There are actually several different ways a database can proceed with this query; we'll just examine one of the possible approaches. The first step is to create a new virtual table listing students and instructors for all courses. This is done using a special database operation called a join of two tables. The basic idea is to combine each row of one table with each corresponding row of the other table, where the correspondence is established by a key column that appears in both tables. For example, when we join the two tables of the bottom panel of the figure on page 142 using the “course number” column as the key, the result is a virtual table exactly like the one in the figure's top panel—each student name is combined with all of the details for the relevant course from the second table, and these details are looked up using the “course number” as a key. Of course, the original query was about student names and instructors, so we don't need any of the other columns. Luckily, databases include a projection operation that lets us throw away columns we are not interested in. So after the join operation to combine the two tables, followed by a projection operation to eliminate some unnecessary columns, the database produces the following virtual table:
接下来,数据库使用另一个重要的操作,称为select。select操作会根据某些条件从表中选择部分行,并丢弃其他行,从而生成一个新的虚拟表。在本例中,我们正在寻找选修 Kirby 教授课程的学生,因此我们需要执行一个“select”操作,只选择授课教师为“Kirby 教授”的行。这样我们就得到了这张虚拟表:
Next, the database uses another important operation called select. A select operation chooses some of the rows from a table, based on some criteria, and throws away the other rows, producing a new virtual table. In this case, we are looking for students who take courses from Professor Kirby, so we need to do a “select” operation that chooses only rows in which the instructor is “Prof Kirby.” That leaves us with this virtual table:
查询已接近完成。现在我们需要的只是另一个投影操作,去掉“instructor”列,只留下一个可以回答原始查询的虚拟表:
The query is nearly completed. All we need now is another projection operation, to throw away the “instructor” column, leaving us with a virtual table that answers the original query:
这里值得补充一点技术性说明。如果您恰好熟悉数据库查询语言 SQL,您可能会觉得上面“select”操作的定义相当奇怪,因为 SQL 中的“select”命令的作用远不止选择一些行。这里的术语源自数据库操作的数学理论,即关系代数,其中“select”仅用于选择行。关系代数还包括我们在查询中用来查找 Kirby 教授学生的“join”和“project”操作。
It's worth adding a slightly more technical note here. If you happen to be familiar with the database query language SQL, you might find the above definition of the “select” operation rather strange, as the “select” command in SQL does much more than merely selecting some rows. The terminology here comes from a mathematical theory of database operations, known as relational algebra, in which “select” is used only for selecting rows. Relational algebra also includes the “join” and “project” operations that we used in our query to find Professor Kirby's students.
关系数据库
Relational Databases
将所有数据存储在互连表中的数据库(例如我们一直在使用的表)称为关系数据库。IBM 研究员 EF Codd 在其极具影响力的 1970 年论文“大型共享数据库的关系数据模型”中倡导了关系数据库。与科学界许多最伟大的理念一样,关系数据库现在回想起来似乎很简单,但在当时,它们代表了信息高效存储和处理的巨大飞跃。事实证明,仅仅少数几个操作(例如我们之前看到的关系代数运算“select”、“join”和“project”)就足以生成虚拟表,这些虚拟表基本上可以回答对关系数据库的任何查询。因此,关系数据库可以将其数据存储在以提高效率为目的的结构表中,并使用虚拟表技巧来回答看似需要采用不同格式的数据的查询。
A database that stores all of its data in interconnected tables such as the ones we have been using is called a relational database. Relational databases were advocated by the IBM researcher E. F. Codd in his extraordinarily influential 1970 paper, “A Relational Model of Data for Large Shared Data Banks.” Like many of the greatest ideas in science, relational databases seem simple in retrospect—but at the time, they represented a huge leap forward in the efficient storage and processing of information. It turns out that a mere handful of operations (such as the relational algebra operations “select,” “join,” and “project” we saw earlier) are sufficient to generate virtual tables that answer essentially any query to a relational database. So a relational database can store its data in tables that are structured for efficiency, and use the virtual table trick to answer queries that seemingly require the data to be in a different form.
这就是为什么关系数据库被用于支持大量电子商务活动的原因。每当您在线购物时,您很可能都在与大量存储产品、客户和个人购买信息的关系数据库表进行交互。在网络空间中,我们无时无刻不在被关系数据库包围,甚至常常没有意识到这一点。
That's why relational databases are used to support a large proportion of e-commerce activities. Whenever you buy something online, you are probably interacting with a slew of relational database tables storing information about products, customers, and individual purchases. In cyberspace, we are constantly surrounded by relational databases, often without even realizing it.
数据库的人性化一面
THE HUMAN SIDE OF DATABASES
对于普通读者来说,数据库可能是本书中最无趣的话题。数据存储本身就很难让人兴奋。但在幕后,数据库运作的精妙理念却讲述了一个截然不同的故事。数据库构建于可能在任何操作过程中发生故障的硬件之上,却依然为我们提供了我们期望从网上银行和类似活动中获得的效率和坚如磐石的可靠性。待办事项列表技巧为我们提供了原子事务,即使数千名客户同时与数据库交互,也能确保一致性。这种高并发性,加上通过虚拟表技巧实现的快速查询响应,使大型数据库高效运行。待办事项列表技巧还能在发生故障时保证一致性。当与复制数据库的“准备后提交”技巧相结合时,我们就能获得坚不可摧的数据一致性和持久性。
To the casual observer, databases may well be the least exciting topic in this book. It's just hard to get excited about data storage. But under the covers, the ingenious ideas that make databases work tell a different story. Built out of hardware that can fail in the middle of any operation, databases nevertheless give us the efficiency and rocksolid dependability that we have come to expect from online banking and similar activities. The to-do list trick gives us atomic transactions, which enforce consistency even when thousands of customers are simultaneously interacting with a database. This immense level of concurrency, together with rapid query responses via the virtual table trick, make large databases efficient. The to-do list trick also guarantees consistency in the face of failures. When combined with the prepare-then-commit trick for replicated databases, we are left with iron-clad consistency and durability for our data.
数据库战胜不可靠组件的伟大胜利,计算机科学家称之为“容错”,这是许多研究人员数十年的心血结晶。但其中最重要的贡献者之一是 Jim Gray,他是一位出色的计算机科学家,撰写了有关事务处理的书籍。(该书即《事务处理:概念与技术》,首次出版于 1992 年。)遗憾的是,Gray 的职业生涯结束得早:2007 年的一天,他驾驶游艇驶出旧金山湾,穿过金门大桥,驶入公海,计划前往附近的一些岛屿进行一日游。从此,Gray 或他的船再也没有踪迹。在这个悲剧故事中,令人感动的是,Gray 在数据库社区的许多朋友用他自己的工具拯救了他:新生成的旧金山附近海洋卫星图像被上传到数据库,以便朋友和同事可以搜索这位失踪的数据库先驱的任何踪迹。不幸的是,这次搜寻并没有成功,计算机科学界失去了一位杰出的人物。
The heroic triumph of databases over unreliable components, known by computer scientists as “fault-tolerance,” is the work of many researchers over many decades. But among the most important contributors was Jim Gray, a superb computer scientist who literally wrote the book on transaction processing. (The book is Transaction Processing: Concepts and Techniques, first published in 1992.) Sadly, Gray's career ended early: one day in 2007, he sailed his yacht out of San Francisco Bay, under the Golden Gate Bridge, and into the open ocean on a planned day trip to some nearby islands. No sign of Gray, or his boat, was ever seen again. In a heart-warming twist to this tragic story, Gray's many friends in the database community used his own tools in an effort to save him: freshly generated satellite imagery of the ocean near San Francisco was uploaded to a database so that friends and colleagues could search for any trace of the missing database pioneer. Unfortunately, the search was not successful, and the world of computer science was left without one of its leading luminaries.
9
9
数字签名:谁真正编写了这个软件?
Digital Signatures: Who Really Wrote This Software?
—查尔斯·狄更斯,《双城记》
—CHARLES DICKENS, A Tale of Two Cities
在本书中我们将遇到的所有概念中,“数字签名”的概念或许是最矛盾的。“数字”一词,从字面上理解,是指“由一串数字组成”。因此,根据定义,任何数字化的东西都可以被复制:要复制,只需一次复制一位数字即可。如果你能读懂,你就能复制!另一方面,“签名”的意义在于它可以被读取,但除了作者之外,任何人都无法复制(即伪造)。那么,如何才能创建一个既是数字又无法复制的签名呢?在本章中,我们将探索这一有趣悖论的解决方案。
Of all the ideas we'll encounter in this book, the concept of a “digital signature” is perhaps the most paradoxical. The word “digital,” interpreted literally, means “consisting of a string of digits.” So, by definition, anything that is digital can be copied: to do so, just copy the digits one at a time. If you can read it, you can copy it! On the other hand, the whole point of a “signature” is that it can be read, but can't be copied (that is, forged) by anyone other than its author. How could it be possible to create a signature that is digital, yet can't be copied? In this chapter, we will discover the resolution of this intriguing paradox.
数字签名的真正用途是什么?
WHAT ARE DIGITAL SIGNATURES REALLY USED FOR?
似乎没有必要问这个问题:数字签名有什么用?当然,你可能会想,我们可以用它们做和纸质签名一样的事情:签署支票和其他法律文件,比如公寓租约。但如果你仔细想一想,就会意识到事实并非如此。无论你使用信用卡还是网上银行系统进行在线支付,你都需要提供任何类型的签名吗?答案是否定的。通常,在线信用卡支付不需要任何签名。网上银行系统略有不同,因为它们要求你使用密码登录以验证你的身份。但如果你稍后在网上银行会话期间付款,则不需要任何类型的签名。
It might seem unnecessary to ask the question: what are digital signatures used for? Surely, you might think, we can use them for the same kinds of things that paper signatures are used for: signing checks and other legal documents, such as the lease on an apartment. But if you think about it for a moment, you will realize that this isn't true. Whenever you make an online payment for something, whether by credit card or through an online banking system, do you provide any kind of signature? The answer is no. Typically, online credit card payments require no signature whatsoever. Online banking systems are a little different, because they require you to log in with a password that helps to verify your identity. But if you later make a payment during your online banking session, no signature of any kind is required.
您的计算机会自动检查数字签名。上图:当我尝试下载并运行具有有效数字签名的程序时,我的网络浏览器显示的消息。下图:数字签名无效或缺失导致的结果。
Your computer checks digital signatures automatically. Top: The message my web browser displays when I attempt to download and run a program that has a valid digital signature. Bottom: The result of an invalid or missing digital signature.
那么,数字签名在实际中到底有什么用途呢?答案可能与您最初的想法相反:通常情况下,不是您自己签署发送给他人的材料,而是其他人在将材料发送给您之前先进行签名。您可能没有意识到这一点,因为您的计算机会自动验证数字签名。例如,每当您尝试下载并运行某个程序时,您的网络浏览器可能会检查该程序是否具有数字签名以及该签名是否有效。然后,它会显示相应的警告,例如上述警告。
What, then, are digital signatures used for in practice? The answer is the reverse of what you might first think: instead of you signing material that is sent to others, it is typically others who sign material before sending it to you. The reason you are probably not aware of this is that the digital signatures are verified automatically by your computer. For example, whenever you attempt to download and run a program, your web browser probably checks to see if the program has a digital signature and whether or not the signature is valid. Then it can display an appropriate warning, like the ones above.
如您所见,存在两种可能性。如果软件具有有效签名(如图上图所示),计算机可以完全放心地告诉您编写该软件的公司名称。当然,这并不能保证软件是安全的,但至少您可以根据对该公司的信任程度做出明智的决定。另一方面,如果签名无效或缺失(如图下图所示),您就完全无法确定软件的真正来源。即使您认为自己是从信誉良好的公司下载的软件,也有可能黑客以某种方式用恶意软件替换了正版软件。或者,该软件可能是由业余爱好者开发的,他们没有时间或动力创建有效的数字签名。在这种情况下,是否信任该软件取决于您(用户)。
As you can see, there are two possibilities. If the software has a valid signature (as in the top panel of the figure), the computer can tell you with complete confidence the name of the company that wrote the software. Of course, this doesn't guarantee that the software is safe, but at least you can make an informed decision based on the amount of trust you have in the company. On the other hand, if the signature is invalid or missing (as in the bottom panel of the figure), you have absolutely no reassurance about where the software really came from. Even if you thought you were downloading software from a reputable company, it's possible that a hacker somehow substituted some malicious software for the real thing. Alternatively, maybe the software was produced by an amateur who did not have the time or motivation to create a valid digital signature. It is up to you, the user, to decide whether you trust the software under these circumstances.
虽然软件签名是数字签名最明显的应用,但它绝不是唯一的应用。事实上,您的计算机接收和验证数字签名的频率令人惊讶,因为一些常用的互联网协议会使用数字签名来验证与您交互的计算机的身份。例如,网址以“https”开头的安全服务器通常会在建立安全会话之前向您的计算机发送数字签名证书。数字签名也用于验证许多软件组件(例如浏览器插件)的真实性。您可能在浏览网页时看到过关于此类内容的警告信息。
Although software-signing is the most obvious application of digital signatures, it is by no means the only one. In fact, your computer receives and verifies digital signatures surprisingly often, because some frequently used internet protocols employ digital signatures to verify the identity of the computers you are interacting with. For example, secure servers whose web addresses begin with “https” typically send your computer a digitally signed certificate before establishing a secure session. Digital signatures are also used to verify the authenticity of many software components, such as browser plugins. You have probably seen warning messages about such things while surfing the web.
您可能遇到过另一种在线签名:有些网站要求您在在线表格中输入姓名作为签名。例如,我有时在为我的一个学生填写在线推荐信时就必须这样做。这可不是计算机科学家所说的数字签名!显然,任何知道您姓名的人都可以毫不费力地伪造这种输入的签名。在本章中,我们将学习如何创建无法伪造的数字签名。
There is another type of online signature you may have encountered: some websites ask you to type your name as a signature in an online form. I sometimes have to do this when filling out an online recommendation letter for one of my students, for instance. This is not what a computer scientist means by a digital signature! Obviously, this kind of typed signature can be forged effortlessly, by anyone who knows your name. In this chapter, we will learn how to create a digital signature that cannot be forged.
纸质签名
PAPER SIGNATURES
我们对数字签名的解释将循序渐进,从我们熟悉的纸质签名开始,逐步过渡到真正的数字签名。首先,让我们回到一个完全没有计算机的世界。在这个世界上,验证文件的唯一方法是纸质手写签名。请注意,在这种情况下,签名文件无法单独进行验证。例如,假设你找到一张纸,上面写着“我承诺支付给弗朗索瓦丝 100 美元。签名,拉维”——正如上图所示。你如何验证拉维确实签署了这份文件?答案是,你需要一个值得信赖的签名存储库,你可以去那里检查拉维的签名是否真实。在现实世界中,银行和政府部门等机构承担着这一角色——它们确实保存着存储客户签名的文件,并在必要时对这些文件进行物理检查。在我们的模拟场景中,我们假设一个名为“纸质签名银行”的值得信赖的机构将每个人的签名都保存在文件中。上图为纸质签名库的示意图。
Our explanation of digital signatures is going to be built up gradually, starting with the familiar situation of paper signatures and moving in small steps toward genuine digital signatures. So to start with, let's go back to a world with no computers at all. In this world, the only way to authenticate documents is with handwritten signatures on paper. Notice that in this scenario, a signed document can't be authenticated in isolation. For example, suppose you find a piece of paper that says “I promise to pay $100 to Francoise. Signed, Ravi”—just as shown above. How can you verify that Ravi really signed this document? The answer is that you need some trusted repository of signatures, where you can go and check that Ravi's signature is genuine. In the real world, institutions such as banks and government departments perform this role—they really do keep files storing the signatures of their customers, and these files can be physically checked if necessary. In our pretend scenario, let's imagine that a trusted institution called a “paper signature bank” keeps everyone's signature on file. A schematic example of a paper signature bank is shown above.
带有手写签名的纸质文件。
A paper document with a handwritten signature.
将客户身份与手写签名一起保存在文件中的银行。
A bank that stores the identities of its customers together with handwritten signatures on file.
为了验证承诺支付给 Fran-coise 的文件上 Ravi 的签名,我们只需要去纸质签名银行并要求查看 Ravi 的签名。显然,我们在这里做了两个重要的假设。首先,我们假设银行是值得信赖的。理论上,银行员工可以将 Ravi 的签名替换为冒名顶替者的签名,但我们在这里忽略这种可能性。其次,我们假设冒名顶替者不可能伪造 Ravi 的签名。众所周知,这个假设是完全错误的:熟练的伪造者可以轻易复制签名,甚至业余爱好者也可以做出合理的近似。尽管如此,我们需要不可伪造的假设——没有它,纸质签名就毫无用处。稍后,我们将看到数字签名基本上是不可能伪造的。这是数字签名相对于纸质签名的一大优势。
To verify Ravi's signature on the document promising to pay Fran-coise, we just need to go to the paper signature bank and ask to see Ravi's signature. Obviously, we are making two important assumptions here. First, we assume the bank can be trusted. In theory, it would be possible for the bank employees to switch Ravi's signature for an imposter's, but we are going to ignore this possibility here. Second, we assume it is impossible for an imposter to forge Ravi's signature. This assumption, as everyone knows, is plain wrong: a skilled forger can easily reproduce a signature, and even amateurs can do a reasonable approximation. Nevertheless, we need the assumption of unforgeability—without it, the paper signature is useless. Later on, we will see how digital signatures are essentially impossible to forge. This is one of the big advantages of digital signatures over paper ones.
用挂锁签名
SIGNING WITH A PADLOCK
我们迈向数字签名的第一步是彻底摒弃纸质签名,采用一种新的文件验证方法,该方法依赖于挂锁、钥匙和上锁的盒子。新方案中的每个参与者(在我们的示例中,指的是 Ravi、Takeshi 和 Francoise)都会获得大量的挂锁。至关重要的是,每个参与者的挂锁都必须完全相同——因此 Ravi 的挂锁也完全相同。此外,每个参与者的挂锁必须是独占的:没有其他人可以制造或获得与 Ravi 相同的挂锁。最后,本章中的所有挂锁都具有一个相当独特的特性:它们都配备了生物识别传感器,确保只有其所有者才能上锁。因此,如果 Francoise 发现 Ravi 的挂锁被打开了,她就无法用它锁任何东西。当然,Ravi 也拥有一些可以打开他挂锁的钥匙。由于他所有的挂锁都相同,所以所有的钥匙也都相同。下一页以示意图的形式展示了目前的情况。这就是我们所说的“物理挂锁技巧”的初始设置。
Our first step toward digital signatures is to abandon paper signatures altogether and adopt a new method of authenticating documents that relies on padlocks, keys, and locked boxes. Every participant in the new scheme (in our running example, that means Ravi, Takeshi, and Francoise) acquires a large supply of padlocks. It is crucial that the padlocks belonging to each individual participant are identical—so Ravi's padlocks are all the same. Additionally, each participant's padlocks must be exclusive: no one else can make or obtain a padlock like Ravi's. And finally, all padlocks in this chapter have a rather unusual feature: they are equipped with biometric sensors which ensure they can only be locked by their owner. So if Francoise finds an open padlock belonging to Ravi, she can't use it to lock anything. Of course, Ravi also has a supply of keys that will open his padlocks. Because all of his padlocks are identical, all the keys are identical too. The situation so far is shown schematically on the following page. This is the initial setup for what we might call the “physical padlock trick.”
现在假设,和之前一样,Ravi 欠 Francoise 100 美元,Francoise 希望以可验证的方式记录这一事实。换句话说,Francoise 想要获得与上一页文件等价的金额,但不需要手写签名。这个技巧的运作方式如下:Ravi 制作了一份文件,上面写着“Ravi 承诺向 Francoise 支付 100 美元”,但他懒得签名。他复印了一份文件,并将其放入一个带锁的盒子中。(带锁的盒子就是一个坚固的盒子,可以用挂锁锁住。)最后,Ravi 用一把挂锁锁上盒子,并将上锁的盒子交给 Francoise。完整的流程如图所示。上锁的盒子就是文件的签名,这一点我们很快就会理解。请注意,最好让 Francoise 或其他值得信赖的见证人在签名过程中进行观察。否则,拉维可以通过将另一份文件放入保险箱来作弊。(可以说,如果保险箱是透明的,这个方案会更有效。毕竟,数字签名提供的是真实性,而不是保密性。然而,透明的保险箱有点违反直觉,所以我们不会探讨这种可能性。)
Now let's suppose that just as before, Ravi owes Francoise $100, and Francoise would like to record that fact in a verifiable way. In other words, Francoise wants the equivalent of the document on the previous page, but without relying on a handwritten signature. Here is how the trick is going to work. Ravi makes a document saying “Ravi promises to pay $100 to Francoise,” and doesn't bother to sign it. He makes a copy of the document and places this document in a lockbox. (A lockbox is just a strongly made box that can be locked with a padlock.) Finally, Ravi locks the box with one of his padlocks and gives the locked box to Francoise. The complete package is shown in the figure on the facing page. In a sense that will be made precise very soon, the locked box is the signature for the document. Note that it would be a good idea for Francoise, or some other trusted witness, to watch while the signature is created. Otherwise, Ravi could cheat by putting a different document into the box. (Arguably, this scheme would work even better if the lockboxes were transparent. After all, digital signatures provide authenticity, not secrecy. However, transparent lockboxes are a little counterintuitive, so we won't pursue this possibility.)
在实体挂锁魔术中,每个参与者都有
一组相同的挂锁和钥匙。
In the physical padlock trick, each participant has
an exclusive supply of identical padlocks and keys.
或许你已经明白弗朗索瓦丝现在如何验证拉维文件的真实性了。如果有人,甚至可能是拉维本人,试图否认文件的真实性,弗朗索瓦丝可以说:“好的,拉维,请借我一把钥匙。现在我要用你的钥匙打开这个保险箱。” 在拉维和其他一些证人(甚至可能是法庭上的法官)的见证下,弗朗索瓦丝打开挂锁,展示保险箱里的物品。然后,弗朗索瓦丝可以继续说道:“拉维,由于你是唯一可以使用这把钥匙打开挂锁的人,所以其他人不可能对保险箱里的物品负责。因此,只有你写了这张纸条,并把它放进了保险箱。你欠我100美元!”
Perhaps you can already see how Francoise can now authenticate Ravi's document. If anyone, perhaps even Ravi himself, tries to deny the authenticity of the document, Francoise can say “Okay Ravi, please lend me one of your keys for a minute. Now I'm going to open this lockbox using your key.” In the presence of Ravi and some other witnesses (maybe even a judge in a court of law), Fran-coise opens the padlock and displays the contents of the lockbox. Then Francoise can continue: “Ravi, as you are the only person with access to padlocks that work with this key, no one else can possibly be responsible for the contents of the lockbox. Therefore, you and only you wrote this note and put it in the lockbox. You do owe me $100!”
虽然乍一听可能有点复杂,但这种身份验证方法既实用又有效。不过,它也有一些缺点。主要问题是它需要拉维的配合:在弗朗索瓦丝能够证明任何事情之前,她必须说服拉维借给她一把钥匙。但拉维可能会拒绝,甚至更糟的是,他可能会假装配合,但实际上给她另一把钥匙——一把打不开他挂锁的钥匙。这样,当弗朗索瓦丝打不开锁箱时,拉维就可以说:“看,那不是……
Although it sounds convoluted when you first encounter it, this method of authentication is both practical and powerful. It does have some drawbacks, however. The main problem is that it requires Ravi's cooperation: before Francoise can prove anything, she has to persuade Ravi to lend her one of his keys. But Ravi could refuse, or even worse, he could pretend to cooperate but in fact give her a different key—a key that will not open his padlock. Then, when Fran-coise fails to open the lockbox, Ravi can say, “See, that's not one of
为了使用物理挂锁技巧制作可验证的签名,拉维将文件的副本放在锁箱中,并用一把挂锁将其锁上。
To make a verifiable signature using the physical padlock trick, Ravi places a copy of the document in a lockbox and locks it with one of his padlocks.
我的挂锁上有文件,所以伪造者可以在我不知情的情况下伪造文件并将其放入我的挂锁中。”
my padlocks, so a forger could have created the document and put it in there without my knowledge.”
为了防止拉维的这种狡猾手段,我们仍然需要借助银行等可信第三方。与第152页的纸质签名银行不同,我们的新银行会存储密钥。因此,参与者现在不再向银行提供签名副本,而是向银行提供一把可以打开挂锁的实体钥匙。下一页的图显示了一个实体钥匙银行。
To prevent this cunning approach by Ravi, we still need to resort to a trusted third party such as a bank. In contrast to the paper signature bank on page 152, our new bank will store keys. So instead of giving the bank a copy of their signatures, participants now give the bank a physical key that will open their padlocks. A physical key bank is shown in the figure on the following page.
这家银行是谜题的最后一块拼图,也为“实体挂锁”的诡计画画上了圆满的句号。如果弗朗索瓦丝需要证明拉维开具了欠条,她只需带着保险箱和几名证人去银行,用拉维的钥匙打开即可。挂锁能打开的事实,证明只有拉维才能对保险箱里的东西负责,而保险箱里装着弗朗索瓦丝试图验证的文件的副本。
This bank is the final piece in the puzzle, completing the explanation of the physical padlock trick. If Francoise ever needs to prove that Ravi wrote the IOU, she just takes the lockbox to the bank with some witnesses and opens it there with Ravi's key. The fact that the padlock opens proves that only Ravi could be responsible for the contents of the box, and the box contains an exact copy of the document that Francoise is trying to authenticate.
使用乘法锁签名
SIGNING WITH A MULTIPLICATIVE PADLOCK
我们构建的钥匙和挂锁基础设施正是数字签名所需的方法。然而,显然,我们不能使用物理挂锁和物理钥匙来进行必须以电子方式传输的签名。因此,下一步是用可以数字表示的类似数学对象来取代挂锁和钥匙。具体来说,挂锁和钥匙将用数字表示,而锁定或解锁的操作将用时钟算法中的乘法来表示。如果您对时钟算法不太熟悉,现在是重读第4章第52页中相关解释的好时机。
The key-and-padlock infrastructure that we've built up turns out to be exactly the approach required for digital signatures. Obviously, however, we can't use physical padlocks and physical keys for signatures that must be transmitted electronically. So the next step is to replace the padlocks and keys with analogous mathematical objects that can be represented digitally. Specifically, the padlocks and keys will be represented by numbers, and the act of locking or unlocking will be represented by multiplication in clock arithmetic. If you're not too familiar with clock arithmetic, now would be a great time to reread the explanation given in chapter 4, on page 52.
实体钥匙库存储着可打开每位参与者挂锁的钥匙。请注意,每把钥匙都不同。
A physical key bank stores keys that will open each participant's padlocks. Note that each of the keys is different.
为了创建不可伪造的数字签名,计算机需要使用非常大的时钟长度——通常长度可达数十或数百位。然而,在本描述中,我们将使用一个不切实际的小时钟长度,以确保计算简单。
To create unforgeable digital signatures, computers use absolutely enormous clock sizes—typically tens or hundreds of digits in length. However, in this description, we will be using an unrealistically small clock size to ensure that the calculations are simple.
具体来说,本节中的所有示例都将使用 11 的时钟大小。因为我们将经常使用此时钟大小将数字相乘,所以我在下一页提供了一个表格,列出了小于 11 的数字相乘的所有值。例如,让我们计算 7 × 5。要手动完成此操作,不使用表格,我们首先使用常规算术计算答案:7 × 5 = 35。然后,我们取除以 11 后的余数。现在,11 除以 35 三次(得到 33),剩下 2。所以最终答案应该是 2。查看表格,我们看到第 7 行第 5 列的条目确实是 2。(您也可以使用第 7 列第 5 行 - 顺序无关紧要,因为您可以自己检查。)自己尝试另外几个乘法示例以确保您理解。
Specifically, all the examples in this section will use a clock size of 11. Because we will be multiplying numbers together using this clock size quite a bit, I've provided a table on the next page, listing all the values for multiplying together numbers less than 11. As an example, let's compute 7 × 5. To do it manually, without using the table, we would first compute the answer using regular arithmetic: 7 × 5 = 35. Then, we take the remainder after dividing by 11. Now, 11 goes into 35 three times (giving 33) with 2 left over. So the final answer should be 2. Looking at the table, we see the entry in row 7 and column 5 is indeed 2. (You can also use column 7 and row 5—the order doesn't matter, as you can check for yourself.) Try out another couple of multiplication examples for yourself to make sure you understand.
在继续之前,我们需要对试图解决的问题稍作修改。之前,我们一直在寻找让Ravi“签署”一条消息(实际上是一张欠条)给Francoise的方法。这条消息是用纯英语写的。但从现在开始,只使用数字会方便得多。因此,我们必须承认,计算机很容易将消息翻译成一串数字供Ravi签名。以后,如果有人需要验证Ravi对这串数字的数字签名,只需逆转翻译过程,将数字转换回英语即可。我们在讨论校验和(第68页)和短符号技巧(第109页)时也遇到了同样的问题。如果你想更详细地理解这个问题,请回顾一下关于短符号技巧的讨论——第111页的图表提供了一种在字母和数字之间简单、明确地转换的可能性。
Before going on, we need a slight change to the problem that we're trying to solve. Previously, we've been looking for ways for Ravi to “sign” a message (actually, an IOU) to Francoise. The message was written in plain English. But from now on, it will be much more convenient to work with numbers only. Therefore, we have to agree that it would be easy for a computer to translate the message into a sequence of numbers for Ravi to sign. Later, if and when someone needs to authenticate Ravi's digital signature of this sequence of numbers, it will be a simple matter to reverse the translation and convert the numbers back into English. We encountered this same problem when talking about checksums (page 68) and the shorter-symbol trick (page 109). If you would like to understand this issue in more detail, look back over the discussion of the shorter-symbol trick—the figure on page 111 gives one simple, explicit possibility for translating between letters and numbers.
时钟尺寸为 11 的乘法表。
The multiplication table for clock size 11.
因此,Ravi 不必对用英文书写的消息进行签名,而是要对一串数字进行签名,例如“494138167543…83271696129149”。不过,为了简单起见,我们首先假设要签名的消息非常短:实际上,Ravi 的消息将由一位数字组成,例如“8”或“5”。别担心:我们最终会学习如何对更合理长度的消息进行签名。不过,目前最好还是坚持使用一位数字的消息。
So, instead of signing a message written in English, Ravi has to sign a sequence of numbers, perhaps something like “494138167543…83271696129149.” However, to keep things simple, we will start off by assuming the message to be signed is ridiculously short: in fact, Ravi's message will consist of a single digit, like “8” or “5.” Don't worry: we will eventually learn how to sign messages of a more sensible length. For now, however, it's better to stick with single-digit messages.
搞定这些准备工作后,我们就可以开始理解一个新魔术的核心了,这个魔术叫做“乘法挂锁魔术”。和实体挂锁魔术一样,拉维需要一把挂锁和一把能打开挂锁的钥匙。获得挂锁的方法出奇地简单:拉维首先选择一个钟表尺寸,然后选择任意一个小于该钟表尺寸的数字作为他的数字“挂锁”。(实际上,有些数字比其他数字效果更好,但这些细节可能会让我们误入歧途。)为了更具体一些,假设拉维选择 11 作为钟表尺寸,6 作为挂锁。
With these preliminaries out of the way, we are ready to understand the heart of a new trick, called the “multiplicative padlock trick.” As with the physical padlock trick, Ravi is going to need a padlock and a key that unlocks the padlock. Obtaining a padlock is surprisingly easy: Ravi first selects a clock size and then chooses essentially any number less than the clock size as his numerical “padlock.” (Actually, some numbers work better than others, but these details would lead us too far astray.) To make things concrete, let's say Ravi chooses 11 as his clock size and 6 as his padlock.
如何使用“挂锁”将数字消息“锁定”,从而创建数字签名。上行展示了如何使用物理挂锁将消息物理地锁定在盒子中。下行展示了类似的数学运算,其中消息是一个数字 (5),挂锁是另一个数字 (6),锁定过程相当于与给定时钟大小的乘法。最终结果 (8) 即为该消息的数字签名。
How to “lock” a numeric message using a “padlock,” creating a digital signature. The top row shows how to physically lock a message in a box using a physical padlock. The bottom row shows the analogous mathematical operation, in which the message is a number (5), the padlock is another number (6), and the process of locking corresponds to multiplication with a given clock size. The final result (8) is the digital signature for the message.
那么,Ravi 怎么才能用这把挂锁将他的消息“锁”进保险箱呢?听起来可能有点奇怪,Ravi 会用乘法来实现这一点:他的消息“锁定”后的结果等于挂锁乘以消息本身(当然,时钟尺寸是 11)。记住,我们现在处理的是单个数字消息的简单情况。假设 Ravi 的消息是“5”。那么他的“锁定”消息就是 6 × 5,也就是 8——照例,时钟尺寸也是 11。(使用上一页的乘法表再次核对。)上图总结了这个过程。最终结果“8”就是 Ravi 对原始消息的数字签名。
Now, how can Ravi “lock” his message into a lockbox with this padlock? As strange as it might sound, Ravi is going to use multiplication to do this: the “locked” version of his message will be the padlock multiplied by the message (using clock size 11, of course). Remember, we are dealing with the simple case of a single-digit message right now. So suppose Ravi's message is “5.” Then his “locked” message will be 6 × 5, which is 8—with clock size 11, as usual. (Doublecheck this using the multiplication table on the previous page.) This process is summarized in the figure above. The final result, “8,” is Ravi's digital signature for the original message.
当然,如果我们以后不能用某种数学“钥匙”解锁信息,这种数学“挂锁”就毫无意义了。幸运的是,我们发现了一种简单的解锁信息的方法。诀窍是再次使用乘法(像往常一样应用时钟大小),但这次我们将乘以一个不同的数字——一个特意选择的数字,以便解锁之前选择的挂锁号码。
Of course, this type of mathematical “padlocking” would be pointless if we couldn't later unlock the message using a mathematical “key” of some sort. Fortunately, it turns out there is an easy way to unlock messages. The trick is to use multiplication again (applying the clock size, as usual), but this time we'll multiply by a different number—a number selected especially so that it unlocks the previously chosen padlock number.
我们暂时使用相同的具体示例,Ravi 使用的时钟大小为 11,挂锁编号为 6。结果对应的钥匙是 2。我们怎么知道的?我们稍后会回到这个重要的问题。目前,让我们继续完成更简单的任务,即在别人告诉我们钥匙的数值后验证钥匙是否有效。如前所述,我们通过乘以钥匙来解锁挂锁消息。在上一页的图中,我们已经看到,当 Ravi 用挂锁 6 锁定消息 5 时,他得到了锁定的消息(或数字签名)8。要解锁,我们将这个 8 乘以钥匙 2,应用时钟大小后得出 5。就像魔术一样,我们最终又得到了原始消息 5!整个过程如上图所示,您还可以在其中看到其他几个示例:消息“3”在挂锁时变为“7”,使用钥匙后又变回“3”。类似地,“2”在锁定时变为“1”,但钥匙将其转换回“2”。
Let's stick with the same concrete example for the moment, so Ravi is using a clock size of 11, with 6 as his padlock number. turns out that the corresponding key is 2. How do we know that? We will come back to this important question later. For the moment, let's stick with the easier task of verifying that the key works once someone else has told us its numeric value. As mentioned earlier, we unlock a padlocked message by multiplying by the key. We have already seen, in the figure on the previous page, that when Ravi locks the message 5 with padlock 6, he gets the locked message (or digital signature) 8. To unlock, we take this 8 and multiply by the key 2, which gives 5 after applying the clock size. Like magic, we have ended up back with the original message, 5! The whole process is shown in the figure above, where you can also see a couple of other examples: the message “3” becomes “7” when padlocked, and reverts to “3” when the key is applied. Similarly, “2” becomes “1” when locked, but the key converts it back to “2.”
如何使用数字挂锁和相应的数字键“锁定”和随后“解锁”消息。第一行展示了锁定和解锁的物理版本。接下来的三行展示了使用乘法对消息进行数字锁定和解锁的示例。请注意,锁定过程会生成数字签名,而解锁过程会生成消息。如果解锁的消息与原始消息匹配,则数字签名得到验证,原始消息真实可靠。
How to “lock” and subsequently “unlock” a message using a numeric padlock and a corresponding numeric key. The top row shows the physical version of locking and unlocking. The next three rows show examples of numerically locking and unlocking messages using multiplication. Note that the locking process produces a digital signature, whereas the unlocking process produces a message. If the unlocked message matches the original message, the digital signature is verified and the original message is authentic.
该图还解释了如何验证数字签名。您只需获取签名并使用签名者的乘法密钥将其解锁即可。如果解锁后的消息与原始消息匹配,则签名真实可靠。否则,则该签名必定是伪造的。下一页的表格更详细地展示了此验证过程。在此表中,我们坚持使用 11 的时钟大小,但为了表明我们之前使用的数字挂锁和钥匙并没有什么特殊之处,我们使用了不同的值。具体来说,挂锁值为 9,对应的钥匙值为 5。在表格的第一个示例中,消息为“4”,签名为“3”。签名解锁后为“4”,与原始消息匹配,因此签名是真实的。表格的下一行给出了类似的例子,消息为“8”,签名为“6”。但最后一行展示了如果签名是伪造的会发生什么。在这里,消息仍然是“8”,但签名为“7”。此签名解锁后为“2”,与原始消息不匹配。因此,该签名是伪造的。
This figure also explains how to verify digital signatures. You just take the signature and unlock it using the signer's multiplicative key. If the resulting unlocked message matches the original message, the signature is authentic. Otherwise, it must have been forged. This verification process is shown in more detail in the table on the next page. In this table, we stick with a clock size of 11, but to show that there is nothing special about the numeric padlock and key we have been using up to this point, different values are used for these. Specifically, the padlock value is 9, and the corresponding key value is 5. In the table's first example, the message is “4” with signature “3.” The signature unlocks to “4,” which matches the original message so the signature is genuine. The next row of the table gives a similar example for the message “8” with signature “6.” But the final row shows what happens if the signature is forged. Here, the message is again “8” but the signature is “7.” This signature unlocks to “2,” which does not match the original message. Hence, the signature is forged.
如何检测伪造的数字签名。这些示例使用的挂锁值为 9,密钥值为 5。前两个签名是真实的,但第三个签名是伪造的。
How to detect a forged digital signature. These examples use a padlock value of 9 and a key value of 5. The first two signatures are genuine, but the third is forged.
如果你回想一下实体钥匙和挂锁的场景,你会记得挂锁上装有生物识别传感器,可以防止他人使用——否则伪造者可以利用拉维的挂锁将任何想要的信息锁进盒子里,从而伪造该信息的签名。同样的道理也适用于数字挂锁。拉维必须保密他的挂锁号码。每次他签署一条信息时,他可以同时显示信息和签名,但无法显示用于生成签名的挂锁号码。
If you think back to the physical key and padlock scenario, you will remember that the padlocks have biometric sensors preventing use by others—otherwise a forger could use one of Ravi's padlocks to lock any desired message into a box, thus forging a signature of that message. The same reasoning applies to numeric padlocks. Ravi must keep his padlock number secret. Each time he signs a message, he can reveal both the message and the signature, but not the padlock number used to produce the signature.
那么 Ravi 的时钟大小和数字密钥呢?这些也必须保密吗?答案是否定的。Ravi 可以向公众公布他的时钟大小和密钥值,比如发布在网站上,这样既不影响签名验证方案,又能保证万无一失。如果 Ravi 真的公布了他的时钟大小和密钥值,任何人都可以获得这些数字,从而验证他的签名。乍一看,这种方法确实非常方便,但其中有一些重要的细节需要解决。
What about Ravi's clock size and his numeric key? Must these also be kept secret? The answer is no. Ravi can announce his clock size and key value to the general public, perhaps by publishing them on a website, without compromising the scheme for verifying signatures. If Ravi does publish his clock size and key value, anyone can obtain these numbers and thus verify his signatures. This approach appears, at first glance, to be very convenient indeed—but there are some important subtleties to be addressed.
数字密钥库。该库的作用并非保密数字密钥和时钟大小。相反,它是一个值得信赖的机构,可以获取任何个人的真实密钥和时钟大小。任何索取这些信息的人,库都会免费披露。
A numeric key bank. The role of the bank is not to keep the numeric keys and clock sizes secret. Instead, the bank is a trusted authority for obtaining the true key and clock size associated with any individual. The bank freely reveals this information to anyone who asks for it.
例如,这种方法是否消除了对可信银行的需求?纸质签名技术和物理挂锁钥匙技术都需要可信银行。答案是否定的:仍然需要像银行这样的可信第三方。没有它,Ravi 可以分发错误的密钥值,使他的签名看起来无效。更糟糕的是,Ravi 的敌人可以创建新的数字挂锁和对应的数字密钥,建立一个网站宣布此密钥属于 Ravi,然后使用他们新创建的数字挂锁对任何他们想要的消息进行数字签名。任何相信新密钥属于 Ravi 的人都会认为敌人的消息是由 Ravi 签名的。因此,银行的作用不是保守 Ravi 的密钥和时钟大小的秘密。相反,银行是 Ravi 数字密钥和时钟大小值的可信权威机构。上图说明了这一点。
For example, does the approach eliminate the need for a trusted bank, which was required both for the paper signature technique and for the physical padlock-and-key technique? The answer is no: a trusted third party such as a bank is still required. Without it, Ravi could distribute a false key value that would make his signatures appear invalid. And, even worse, Ravi's enemies could create a new numeric padlock and corresponding numeric key, make a website announcing that this key is Ravi's, and then digitally sign any message they want using their newly minted numeric padlock. Anyone who believes that the new key belongs to Ravi will believe that the enemies' messages were signed by Ravi. Thus, the role of the bank is not to keep Ravi's key and clock size secret. Instead, the bank is a trusted authority for the value of Ravi's numeric key and clock size. The figure above demonstrates this.
总结此讨论的一个有效方法是:数字挂锁是私密的,而数字钥匙和时钟尺寸是公开的。诚然,将钥匙“公开”有点违反直觉,因为在日常生活中,我们习惯于小心翼翼地保管实体钥匙。为了解释这种不寻常的钥匙用途,请回想一下前面描述的实体挂锁技巧。当时,银行保留了拉维钥匙的副本,并乐意将其借给任何想要验证拉维签名的人。因此,从某种意义上说,实体钥匙是“公开的”。同样的道理也适用于乘法钥匙。
A useful way to summarize this discussion would be: numeric padlocks are private, whereas numeric keys and clock sizes are public. It is, admittedly, a little counterintuitive for a key to be “public,” because in our everyday lives we are used to guarding our physical keys very carefully. To clarify this unusual use of keys, think back to the physical padlock trick described earlier. There, the bank kept a copy of Ravi's key and would happily lend it to anyone wishing to verify Ravi's signature. So the physical key was, in some sense, “public.” The same reasoning applies to multiplicative keys.
现在正是时候讨论一个重要的实际问题:如果我们想要对一条长度超过一位数的消息进行签名该怎么办?这个问题有几种不同的答案。最初的解决方案是使用更大的时钟长度:例如,如果我们使用 100 位的时钟长度,那么完全相同的方法就可以让我们用 100 位的签名对 100 位的消息进行签名。对于超过 100 位长度的消息,我们可以将其分成 100 位的块,然后分别对每个块进行签名。但计算机科学家有更好的方法。事实证明,为了进行签名,长消息可以通过应用一种称为加密哈希函数的转换,缩减为单个块(例如 100 位)。我们之前在第 5 章中遇到过加密哈希函数,它们被用作校验和,以确保大型消息(例如软件包)的内容正确(参见第 73 页)。这里的想法非常类似:在进行签名之前,将长消息缩减为一个更小的块。这意味着,像软件包这样非常大的“消息”可以高效地进行签名。为了简单起见,本章的其余部分我们将忽略长消息的问题。
This is a good time to address an important practical issue: what if we want to sign a message longer than one digit? There are several different answers to this question. An initial solution is to use a much larger clock size: if we use a 100-digit clock size, for example, then exactly the same methods allow us to sign 100-digit messages with 100-digit signatures. For a message longer than this, we could just divide it into 100-digit chunks and sign each chunk separately. But computer scientists have a better way of doing this. It turns out that long messages can—for the purposes of signing—be reduced down into a single chunk (of, say, 100 digits), by applying a transformation called a cryptographic hash function. We've encountered cryptographic hash functions before, in chapter 5, where they were used as a checksum to ensure the content of a large message (such as a software package) was correct (see page 73). The idea here is very similar: a long message gets reduced to a much smaller chunk before signing takes place. This means that extremely large “messages,” such as software packages, can be signed efficiently. To keep things simple, we'll ignore the issue of long messages for the rest of the chapter.
另一个重要问题是:这些数字挂锁和钥匙最初是从哪里来的?前面提到过,参与者可以任意选择挂锁的值。可惜的是,这里“本质上”一词背后的细节需要数论本科课程才能理解。假设你还没有机会学习数论,请允许我提供以下提示:如果时钟尺寸是质数,那么任何小于时钟尺寸的正值都可以用作挂锁。否则,情况会更加复杂。质数是指除了 1 和它本身之外没有其他因数的数。因此,你可以看到,本章到目前为止使用的时钟尺寸 11 确实是质数。
Another important question is: where do these numeric padlocks and keys come from originally? It was mentioned earlier that participants can choose essentially any value for their padlock. The details hiding behind the word “essentially” here require an undergraduate course in number theory, unfortunately. But assuming you haven't had the chance to study number theory, allow me to provide the following teaser: if the clock size is a prime number, then any positive value less than the clock size will work as a padlock. Otherwise, the situation is more complicated. A prime number is a number that has no factors, other than 1 and itself. So you can see that the clock size 11 used so far in this chapter is indeed prime.
因此,选择挂锁很容易——尤其是当时钟大小为素数时。但是一旦选择了挂锁,我们仍然需要想出相应的数字钥匙来解锁所选的挂锁。这是一个有趣且非常古老的数学问题。实际上,这个问题的答案已经为人所知几个世纪了,其核心思想甚至更古老:它是一种称为欧几里得算法的技术,由希腊数学家欧几里得在 2000 多年前记录下来。但是,我们不需要在这里探究密钥生成的细节。只要知道,给定一个挂锁值,您的计算机就可以使用一种称为欧几里得算法的著名数学技术来得出相应的钥匙值就足够了。
Thus, choosing the padlock is the easy part—especially if the clock size is prime. But once the padlock is chosen, we still need to come up with the corresponding numeric key that unlocks the chosen padlock. This turns out to be an interesting—and very old—mathematical problem. Actually, the solution has been known for centuries, and the central idea is even older: it is a technique known as Euclid's algorithm, documented over 2000 years ago by the Greek mathematician Euclid. However, we don't need to pursue the details of key generation here. It is enough to know that, given a padlock value, your computer can come up with the corresponding key value using a well-known mathematical technique called Euclid's algorithm.
如果你仍然对这个解释不满意,也许你会更高兴,因为我即将揭示一个戏剧性的转折:用“乘法”方法解释挂锁和钥匙存在根本缺陷,必须摒弃。在下一节中,我们将使用一种不同的数值方法来解释挂锁和钥匙——这种方法在实践中确实得到了应用。那么,我为什么要费心解释这个有缺陷的乘法系统呢?主要原因是每个人都熟悉乘法,这意味着不需要一下子引入太多新概念就可以解释这个系统。另一个原因是,有缺陷的乘法方法与我们接下来要讨论的正确方法之间存在一些有趣的联系。
If you're still dissatisfied with this explanation, maybe you will be happier once I reveal a dramatic turn that we will be taking soon: the whole “multiplicative” approach to padlocks and keys has a fundamental flaw and must be abandoned. In the next section, we'll be using a different numerical approach to padlocks and keys—an approach that is actually used in practice. So why did I bother to explain the flawed multiplicative system? The main reason is that everyone is familiar with multiplication, which means that the system could be explained without requiring too many new ideas all at once. Another reason is that there are some fascinating connections between the flawed multiplicative approach and the correct approach we will consider next.
但在继续之前,让我们先来理解乘法方法的缺陷。回想一下,挂锁值是私有的(即秘密的),而密钥值是公开的。正如刚才讨论的,签名方案的参与者可以自由选择时钟大小(公开)和挂锁值(保持私有),然后使用计算机生成相应的密钥值(通过欧几里得算法,这是我们目前为止一直在使用的乘法密钥的具体情况)。密钥存储在一家值得信赖的银行中,银行会向任何询问的人透露密钥的值。乘法密钥的问题在于,用于从挂锁生成钥匙的技巧——本质上是欧几里得算法——反过来也完全有效:完全相同的技术可以让计算机生成与给定密钥值对应的挂锁值!我们立刻就能明白为什么这会破坏整个数字签名方案。由于密钥值是公开的,任何人都可以计算出所谓的秘密挂锁值。一旦您知道某人的挂锁值,您就可以伪造该人的数字签名。
But before moving on, let's try to understand the flaw in the multiplicative approach. Recall that padlock values are private (i.e., secret), whereas key values are public. As just discussed, a participant in a signature scheme freely chooses a clock size (which is made public) and a padlock value (which remains private), and then generates the corresponding key value using a computer (via Euclid's algorithm, in the particular case of the multiplicative keys we have been using so far). The key is stored in a trustworthy bank, and the bank reveals the key's value to anyone who asks. The problem with a multiplicative key is that the same trick—essentially Euclid's algorithm—that is used to generate a key from a padlock works perfectly well in reverse: exactly the same technique allows a computer to generate the padlock value corresponding to a given key value! We can immediately see why this trashes the whole digital signature scheme. Because the key values are public, the supposedly secret padlock values can be computed by anyone. And once you know someone's padlock value, you can forge that person's digital signature.
使用指数挂锁签名
SIGNING WITH AN EXPONENT PADLOCK
在本节中,我们将把有缺陷的乘法系统升级为一个实际应用的数字签名方案,即 RSA。但新系统将使用一种不太常见的运算——幂运算来代替乘法。事实上,我们在第四章构建公钥密码学的理解时,也经历了同样的解释步骤:首先研究了一个简单但有缺陷的使用乘法的系统,然后研究了使用幂运算的真实版本。
In this section, we will upgrade our flawed multiplicative system to a digital signature scheme, known as RSA, that is actually used in practice. But the new system will use a less-familiar operation called exponentiation in place of the multiplication operation. In fact, we went through the same sequence of explanatory steps when building up our understanding of public key cryptography in chapter 4: we first worked through a simple but flawed system that used multiplication, and then looked at the real version using exponentiation.
因此,如果你对幂的表示法不太熟悉,比如 5 9和 3 4,那么现在是回到第 52 页复习一下的好时机。但需要提醒的是,3 4(“3 的 4 次方”)表示 3×3x3×3。此外,我们还需要一些技术术语。在像 3 4这样的表达式中, 4 称为指数或幂, 3 称为底数。将指数应用于底数的过程称为“提升到幂”,或者更正式地称为指数运算。与第 4 章一样,我们将把指数运算与时钟运算结合起来。本章本节中的所有示例都将使用时钟大小 22。我们需要的唯一指数是 3 和 7,因此我在上面提供了一个表格,显示了 n 3和n 7的值,对于n的每个值,最大为 20(当时钟大小为 22 时)。
So, if you're not too familiar with power notation, like 59 and 34, this would be a great time to go back to page 52 for a refresher. But as a one-line reminder, 34 (“3 to the power of 4”) means 3×3x3×3. In addition, we need a few more technical terms. In an expression like 34, the 4 is called the exponent or power and the 3 is called the base. The process of applying an exponent to a base is called “raising to a power,” or, more formally, exponentiation. As in chapter 4, we will be combining exponentiation with clock arithmetic. All the examples in this section of the chapter will use clock size 22. The only exponents we will need are 3 and 7, so I've provided a table above showing the value of n3 and n7, for every value of n up to 20 (when the clock size is 22).
当时钟大小为 22 时,以 3 和 7 为幂的值。
Values for exponentiating by 3 and 7 when the clock size is 22.
现在让我们检查一下此表中的几个条目,以确保它们有意义。看一下与n = 4 对应的行。如果我们不使用时钟算术,那么我们可以算出 4 3 = 4 × 4 × 4 = 64。但是应用 22 的时钟大小,我们发现 22 能除以 64 的两倍(得到 44),剩下 20。这就解释了 n 3列中的 20 条目。同样,如果不使用时钟算术,您可以算出 4 7 = 16, 384(好吧,您可以相信我),恰好比最接近的 22 的倍数大 16(即 22 × 744 = 16, 368,以防您感兴趣)。所以这解释了 n 7列中的 16 。
Let's check a couple of the entries in this table now, to ensure they make sense. Take a look at the row corresponding to n = 4. If we weren't using clock arithmetic, then we could work out that 43 = 4 × 4 × 4 = 64. But applying the clock size of 22, we see that 22 goes into 64 twice (giving 44), with 20 left over. That explains the entry of 20 in the column for n3. Similarly, without clock arithmetic you can work out that 47 = 16, 384 (okay, you can trust me on that one), which happens to be 16 more than the nearest multiple of 22 (that's 22 × 744 = 16, 368, just in case you are interested). So that explains the 16 in the column for n7.
现在我们终于可以看看真正的数字签名了。该系统的工作原理与上一节中的乘法方法完全相同,但有一点不同:我们不是使用乘法来锁定和解锁消息,而是使用指数运算。和之前一样,Ravi 首先选择一个将要公开的时钟大小。在这里,Ravi 使用的时钟大小是 22。然后,他选择一个秘密的挂锁值,可以是任何小于时钟大小的值(具体细节我们将在稍后简要讨论)。在我们的示例中,Ravi 选择 3 作为挂锁值。然后,他使用计算机进行计算
Now we are finally ready to see a genuine digital signature in action. The system works exactly the same as the multiplicative method from the previous section, with one exception: instead of locking and unlocking messages using multiplication, we use exponentiation. As before, Ravi first chooses a clock size that will be made public. Here, Ravi uses clock size 22. Then he selects a secret padlock value, which can be anything less than the clock size (subject to some fine print that we'll discuss briefly later). In our example, Ravi chooses 3 as his padlock value. Then, he uses a computer to work
使用指数运算锁定和解锁消息。
Locking and unlocking messages using exponentiation.
根据给定的挂锁和时钟尺寸,找出对应的钥匙值。稍后我们会学习更多相关细节。但唯一重要的一点是,计算机可以使用一种众所周知的数学方法,轻松地根据挂锁和时钟尺寸计算出钥匙。在本例中,钥匙值 7 对应于之前选择的挂锁值 3。
out the corresponding key value for the given padlock and clock size. We'll learn a few more details about this later on. But the only important fact is that a computer can easily compute the key from the padlock and clock size, using a well-known mathematical technique. In this case, it turns out that the key value 7 corresponds to the previously selected padlock value 3.
上图展示了一些具体示例,展示了 Ravi 如何对消息进行签名,以及其他人如何解锁签名进行验证。如果消息为“4”,则签名为“20”:我们以挂锁为指数对消息求幂得到这个值。因此,我们需要计算 4 3,将时钟大小考虑进去后结果为 20。(别忘了,您可以使用上一页的表格轻松检查这些计算。)现在,当 Francoise 想要验证 Ravi 的数字签名“20”时,她首先去银行获取 Ravi 时钟大小和密钥的权威值。(银行看起来与之前相同,只是数字不同——参见第 161 页的图。)然后,Francoise 取出签名,以密钥值求幂,并应用时钟大小:这得到 20 7 = 4,同样使用上一页的表格。如果结果与原始消息匹配(在本例中确实匹配),则签名是真实的。图中显示了消息“8”和“7”的类似计算。
The figure above shows some concrete examples of how Ravi can sign messages, and how others can unlock the signatures to check them. If the message is “4,” the signature is “20”: we get this by exponentiating the message with the padlock as exponent. So we need to compute 43, which gives 20 once the clock size is taken into account. (Don't forget, you can easily check any of these computations using the table on the previous page.) Now, when Francoise wants to verify Ravi's digital signature “20,” she first goes to the bank to get authoritative values for Ravi's clock size and key. (The bank looks the same as before, except with different numbers—see the figure on page 161.) Then Francoise takes the signature, exponentiates by the key value, and applies the clock size: this gives 207 = 4, again using the table on the previous page. If the result matches the original message (and in this case it does), the signature is authentic. The figure shows similar calculations for the messages “8” and “7.”
下一页的表格再次展示了该过程,这次重点强调了签名的验证。图中的前两个示例与上图相同(分别为消息“4”和“8”),并且它们具有真实的签名。第三个示例包含消息“8”和签名“9”。通过应用密钥和时钟大小进行解锁,结果为 9 7 = 15,与原始消息不匹配。因此,该签名是伪造的。
The table on the following page shows the process again, this time emphasizing the verification of the signature. The first two examples in this figure are identical to the previous figure (messages “4” and “8,” respectively), and they have genuine signatures. The third example has message “8” and signature “9.” Unlocking, by applying the key and clock size, gives 97 = 15, which doesn't match the original message. Therefore, this signature is forged.
如何利用指数运算检测伪造的数字签名。这些示例使用的挂锁值为 3,密钥值为 7,时钟大小为 22。前两个签名是真实的,但第三个签名是伪造的。
How to detect a forged digital signature with exponentiation. These examples use a padlock value of 3, a key value of 7, and a clock size of 22. The first two signatures are genuine, but the third is forged.
如前所述,这种指数挂锁和指数密钥的方案称为RSA数字签名方案,以其发明者(Ronald Rivest、Adi S hamir 和 Leonard Adleman)的名字命名,他们在20世纪70 年代首次发布了该系统。这听起来可能出奇地熟悉,因为我们已经在第 4 章公钥密码学中遇到过首字母缩略词 RSA。事实上,RSA 既是一种公钥密码方案,又是一种数字签名方案——这并非巧合,因为这两类算法之间存在着深厚的理论联系。在本章中,我们仅探讨了 RSA 的数字签名方面,但您可能已经注意到它与第 4 章中的概念有着惊人的相似之处。
As mentioned earlier, this scheme of exponent padlocks and exponent keys is known as the RSA digital signature scheme, named for its inventors (Ronald Rivest, Adi Shamir, and Leonard Adleman), who first published the system in the 1970s. This may sound eerily familiar, because we already encountered the acronym RSA in chapter 4, on public key cryptography. In fact, RSA is both a public key cryptography scheme and a digital signature scheme—which is no coincidence, as there is a deep theoretical relationship between these two types of algorithms. In this chapter, we have explored only the digital signature aspect of RSA, but you may have noticed some striking similarities to the ideas in chapter 4.
RSA 系统中如何选择时钟大小、挂锁和密钥的细节确实引人入胜,但这些细节并非理解整体方法的必要条件。最重要的一点是,在这个系统中,一旦选择了挂锁值,参与者就可以轻松计算出合适的密钥值。但其他人无法逆转这个过程:即使你知道其他人使用的密钥和时钟大小,也无法计算出相应的挂锁值。这修复了之前解释过的乘法系统中的缺陷。
The details of how to choose clock sizes, padlocks, and keys in the RSA system are truly fascinating, but they aren't needed to understand the overall approach. The most important point is that in this system, a participant can easily compute an appropriate key value once the padlock value has been chosen. But it is impossible for anyone else to reverse the process: if you know the key and clock size being used by someone else, you can't work out the corresponding padlock value. This fixes the flaw in the multiplicative system explained earlier.
至少,计算机科学家认为是这样,但没有人确切知道。RSA 是否真正安全是整个计算机科学中最令人着迷和最棘手的问题之一。首先,这个问题取决于一个古老的未解数学问题和物理学与计算机科学研究交叉领域的一个较新的热门话题。这个数学问题被称为整数分解;而热门研究课题是量子计算。我们将依次探讨 RSA 安全性的这两个方面,但在此之前,我们需要更好地理解像 RSA 这样的数字签名方案的“安全”的真正含义。
At least, computer scientists think it does, but nobody knows for sure. The issue of whether RSA is truly secure is among the most fascinating—and vexing—questions in the whole of computer science. For one thing, this question depends on both an ancient unsolved mathematical problem and a much more recent hot topic at the intersection of physics and computer science research. The mathematical problem is known as integer factorization;, the hot research topic is quantum computing. We are going to explore both of these aspects of RSA security in turn, but before we do that, we need a slightly better understanding of what it really means for a digital signature scheme like RSA to be “secure.”
RSA 的安全性
The Security of RSA
任何数字签名方案的安全性都归结为一个问题:“我的敌人能伪造我的签名吗?” 对于 RSA 来说,这反过来又可以归结为“根据我的公共时钟大小和密钥值,我的敌人能否计算出我的私人挂锁值?” 你可能会苦恼地发现,这个问题的答案很简单,就是“能!” 事实上,你已经知道了:通过反复试验,总是可以推算出某人的挂锁值。毕竟,我们得到了一条消息、一个时钟大小和一个数字签名。我们知道挂锁值小于时钟大小,所以我们可以简单地依次尝试所有可能的挂锁值,直到找到一个能生成正确签名的挂锁值。这只需要用每个尝试的挂锁值对消息进行幂运算即可。问题在于,在实践中,RSA 方案使用的时钟大小非常大——比如说,数千位数字。因此,即使是现有最快的超级计算机,也需要数万亿年才能尝试所有可能的挂锁值。因此,我们并不关心敌人是否能够以任何方式计算出挂锁的值。相反,我们想知道敌人是否能够高效地计算出挂锁的值,从而构成实际威胁。如果敌人最擅长的攻击方法是反复试验(计算机科学家称之为暴力破解),我们总是可以选择足够大的时钟大小,使这种攻击变得不可行。另一方面,如果敌人拥有比暴力破解速度快得多的技术,我们可能会遇到麻烦。
The security of any digital signature scheme comes down to the question, “Can my enemy forge my signature?” For RSA, this in turn boils down to “Can my enemy compute my private padlock value, given my public clock size and key value?” You might be distressed to learn that the simple answer to this question is “Yes!” In fact, you already knew that: it's always possible to work out someone's padlock value by trial and error. After all, we are given a message, a clock size, and a digital signature. We know the padlock value is smaller than the clock size, so we can simply try every possible padlock value in turn, until we find one that produces the correct signature. It's just a matter of exponentiating the message by each trial padlock value. The catch is that, in practice, RSA schemes use absolutely enormous clock sizes—say, thousands of digits long. So even on the fastest existing supercomputer, it would take trillions of years to try all the possible padlock values. Therefore, we are not interested in whether an enemy could compute the padlock value by any means whatsoever. Instead, we want to know if the enemy can do so efficiently enough to be a practical threat. If the enemy's best method of attack is trial and error—also known as brute force by computer scientists—we can always choose our clock size large enough to make the attack impractical. If, on the other hand, the enemy has a technique that works significantly faster than brute force, we might be in trouble.
例如,回顾乘法挂锁和密钥方案,我们了解到签名者可以选择一个挂锁值,然后使用欧几里得算法据此计算出密钥值。但其缺陷在于,攻击者无需诉诸暴力破解即可逆转此过程:事实证明,欧几里得算法也可以用来计算给定钥匙的挂锁值,而且该算法比暴力破解效率高得多。这就是为什么乘法方法被认为是不安全的。
For example, going back to the multiplicative padlock and key scheme, we learned that a signer can choose a padlock value and then compute the key value from this using Euclid's algorithm. But the flaw was that adversaries did not need to resort to brute force to reverse this process: it turned out that Euclid's algorithm could also be used to compute the padlock given the key, and this algorithm is vastly more efficient than brute force. That's why the multiplicative approach is considered insecure.
RSA 与因式分解之间的联系
The Connection between RSA and Factoring
我之前承诺要揭示 RSA 的安全性与一个古老的数学问题——整数分解——之间的联系。为了理解这种联系,我们需要了解更多关于如何选择 RSA 时钟大小的细节。
I promised earlier to reveal a connection between the security of RSA and an age-old mathematical problem called integer factorization. To understand this connection, we need a few more details about how an RSA clock size is chosen.
首先,回顾一下质数的定义:除了 1 和它本身之外,没有其他因数的数。例如,31 是质数,因为 1 × 31 是两个数相乘得到 31 的唯一方法。但 33 不是质数,因为 33 = 3 × 11。
First, recall the definition of a prime number: it is a number that has no factors other than 1 and itself. For example, 31 is prime because 1 × 31 is the only way to produce 31 as the product of two numbers. But 33 is not prime, since 33 = 3 × 11.
现在我们准备演示一下像我们的老朋友 Ravi 这样的签名者如何为 RSA 生成一个时钟大小。Ravi 首先要做的是选择两个非常大的素数。通常这些数字会有几百位,但像往常一样,我们先用一个小例子来说明。假设 Ravi 选择 2 和 11 作为素数。然后他将它们相乘;这就得到了时钟大小。在我们的例子中,时钟大小是 2 × 11 = 22。众所周知,时钟大小将与 Ravi 选择的密钥值一起公开。但是——这是关键所在——时钟大小的两个素因数仍然是秘密,只有 Ravi 自己知道。RSA 背后的数学原理为 Ravi 提供了一种方法,可以使用这两个素因数通过密钥值计算出挂锁值,反之亦然。
Now we are ready to walk through how a signer such as our old friend Ravi can generate a clock size for RSA. The first thing Ravi does is choose two very large prime numbers. Typically these numbers will be hundreds of digits long, but as usual we will work with a tiny example instead. So let's say Ravi chooses 2 and 11 as his prime numbers. Then he multiplies them together; this produces the clock size. So in our example, the clock size is 2 × 11 = 22. As we know, the clock size will be made public along with Ravi's chosen key value. But—and this is the crucial point—the two prime factors of the clock size remain secret, known only to Ravi. The math behind RSA gives Ravi a method of using these two prime factors to compute a padlock value from a key value and vice versa.
该方法的细节将在下一页的面板中描述,但与我们的主要目的无关。我们需要意识到的是,Ravi 的敌人无法利用公开信息(时钟大小和密钥值)计算出他的秘密挂锁值。但如果他的敌人也知道时钟大小的两个素因数,他们就能轻松计算出秘密挂锁值。换句话说,只要 Ravi 的敌人能够分解时钟大小,他们就能伪造他的签名。(当然,也许还有其他破解 RSA 的方法。高效分解时钟大小只是其中一种可能的攻击方法。)
The details of this method are described in the panel on the next page, but they are irrelevant for our main purpose. All we need to realize is that Ravi's enemies cannot compute his secret padlock value using the publicly available information (the clock size and key value). But if his enemies also knew the two prime factors of the clock size, they could easily compute the secret padlock value. In other words, Ravi's enemies can forge his signature if they can factorize the clock size. (Of course, there maybe other ways of cracking RSA. Efficient factorization of the clock size is just one possible method of attack.)
在我们这个小例子中,分解时钟大小(从而破解数字签名方案)极其简单:每个人都知道 22 = 2 × 11。但当时钟大小达到数百甚至数千位时,寻找因数就变得极其困难。事实上,尽管这个所谓的“整数分解”问题已经被研究了几个世纪,但还没有人找到一个通用的、足够有效的方法来破解典型的 RSA 时钟大小。
In our small example, factoring the clock size (and thus cracking the digital signature scheme) is absurdly easy: everyone knows that 22 = 2 × 11. But when the clock size is hundreds or thousands of digits long, finding the factors turns out to be an extremely difficult problem. In fact, although this so-called “integer factorization” problem has been studied for centuries, no one has found a general method of solving it that works efficiently enough to compromise a typical RSA clock size.
数学史上充斥着许多未解之谜,它们仅凭其美学价值就让数学家们着迷,尽管缺乏实际应用,却也激发了人们的深入研究。令人惊讶的是,许多这些看似有趣却看似无用的问题后来却被证明具有重大的实际意义——在某些情况下,这种意义是在问题被研究了几个世纪之后才被发现的。
The history of mathematics is peppered with unsolved problems that fascinated mathematicians by their aesthetic qualities alone, inspiring deep investigation despite the lack of any practical application. Rather astonishingly, many of these intriguing-yet-apparently-useless problems later turned out to have great practical significance—and in some cases, the significance was discovered only after the problem had been studied for centuries.
Ravi 选择两个素数(在我们这个简单的例子中是 2 和 11),并将它们相乘,得到他的时钟尺寸(22)。我们将其称为“主”时钟尺寸,其原因很快就会揭晓。接下来,Ravi 从原来的两个素数中各减一,然后将这两个数相乘。这就得到了 Ravi 的“次”时钟尺寸。在我们的例子中,Ravi 从原来的两个素数中各减一后,剩下的是 1 和 10,因此次时钟尺寸为 1 × 10 = 10。
Ravi chooses two prime numbers (2 and 11, in our simple example) and multiplies them together to produce his clock size (22). Let's refer to this as the “primary” clock size for reasons that will soon become apparent. Next, Ravi subtracts one from each of the original two prime numbers, and multiplies those numbers together. This produces Ravi's “secondary” clock size. In our example, Ravi is left with 1 and 10 after subtracting one from each of the original primes, so the secondary clock size is 1 × 10 = 10.
此时,我们找到了一个与之前描述的有缺陷的乘法锁钥匙系统极其契合的联系:拉维根据乘法系统选择挂锁和钥匙,但使用的是辅助时钟尺寸而不是主时钟尺寸。假设拉维选择 3 作为他的挂锁号码。事实证明,当使用辅助时钟尺寸 10 时,对应的乘法钥匙是 7。我们可以快速验证这个方法是否有效:消息“8”的锁数字为 8 × 3 = 24,或者在 10 号时钟尺寸下为“4”。用钥匙解锁“4”得到 4 × 7 = 28,也就是应用时钟尺寸后的数字“8”——与原始消息相同。
At this point, we encounter an extremely gratifying connection to the flawed multiplicative padlock-and-key system described earlier: Ravi chooses a padlock and key according to the multiplicative system, but using the secondary clock size instead of the primary. Suppose Ravi chooses 3 as his padlock number. It turns out, when using the secondary clock size of 10, that the corresponding multiplicative key is 7. We can quickly verify that this works: the message “8” padlocks to 8 × 3 = 24, or “4” in clock size 10. Unlocking “4” with the key gives 4 × 7 = 28, which is “8” after applying the clock size—the same as the original message.
Ravi 的工作现在完成了:他把刚刚选定的乘法锁和钥匙直接用作RSA 系统中的指数锁和钥匙。当然,它们将用作主时钟大小 22 的指数。
Ravi's work is now done: he takes the multiplicative padlock and key just chosen, and uses them directly as his exponent padlock and key in the RSA system. Of course, they will be used as exponents with the primary clock size, 22.
生成 RSA 时钟、挂锁和密钥值的详细信息。
The gory details of generating RSA clock, padlock, and key values.
整数因式分解就是这样一个问题。最早的严肃研究似乎出现在17世纪,由数学家费马和梅森发起。欧拉和高斯——数学史上两位最伟大的数学家——在随后的几个世纪里做出了贡献,后来许多人也在他们的工作基础上继续发展。但直到20世纪70年代公钥密码学的出现,大数因式分解的难度才成为实际应用的关键。正如你现在所知,任何发现高效大数因式分解算法的人,都能随意伪造数字签名!
Integer factorization is just such a problem. The earliest serious investigations seem to have been in the 17th century, by the mathematicians Fermat and Mersenne. Euler and Gauss—two of the biggest names in the mathematical canon—made contributions in the centuries immediately following, and many others have built on their work. But it was not until the discovery of public key cryptography in the 1970s that the difficulty of factoring large numbers became the linchpin of a practical application. As you now know, anyone who discovers an efficient algorithm for factoring large numbers will be able to forge digital signatures at will!
在听起来过于令人担忧之前,我应该澄清一下,自 20 世纪 70 年代以来,已经发明了许多其他数字签名方案。虽然每个方案都依赖于某些基本数学挑战的难度,但不同的方案依赖于不同的数学挑战。因此,一种有效的因式分解算法的发现只能破解类似 RSA 的方案。
Before this begins to sound too alarming, I should clarify that numerous other digital signature schemes have been invented since the 1970s. Although each scheme depends on the difficulty of some fundamental mathematical challenge, the different schemes rely on different mathematical challenges. Therefore, the discovery of an efficient factorization algorithm will break only the RSA-like schemes.
另一方面,计算机科学家们仍然对所有这些系统都存在的一个令人费解的陷阱感到困惑:所有方案都未被证明是安全的。它们都依赖于一些看似困难、经过深入研究的数学难题。然而,理论家们始终无法证明每种方案都不存在有效的解决方案。因此,尽管专家们认为这种可能性极小,但原则上,这些密码学或数字签名方案中的任何一个都有可能随时被彻底破解。
On the other hand, computer scientists continue to be baffled by an intriguing gotcha that applies to all of these systems: none of the schemes has been proved secure. Each of them depends on some apparently difficult, much-studied mathematical challenge. But in each case, theoreticians have been unable to prove that no efficient solution exists. Thus, although experts consider it extremely unlikely, it is possible in principle that any one of these cryptography or digital signature schemes could be cracked wide open at any time.
RSA 与量子计算机之间的联系
The Connection between RSA and Quantum Computers
我兑现了揭示RSA与一个古老数学问题之间联系的承诺,但我尚未解释它与量子计算这一热门研究课题之间的联系。要阐明这一点,我们首先必须接受以下基本事实:在量子力学中,物体的运动受概率支配——这与经典物理学的确定性定律相反。因此,如果你用易受量子力学效应影响的部件构建一台计算机,它计算的值将由概率决定,而不是像经典计算机那样产生绝对确定的0和1序列。换言之,量子计算机可以同时存储许多不同的值:不同的值具有不同的概率,但在你强制计算机输出最终答案之前,这些值都是同时存在的。这使得量子计算机可以同时计算出许多不同的可能答案。因此,对于某些特殊类型的问题,你可以使用“暴力破解”的方法,同时尝试所有可能的解法!
I've made good on my promise to reveal a connection between RSA and an old mathematical problem, but I have yet to explain the connection to the hot research topic of quantum computing. To pursue this, we must first accept the following fundamental fact: in quantum mechanics, the motion of objects is governed by probabilities—in contrast to the deterministic laws of classical physics. So if you build a computer out of parts that are susceptible to quantum-mechanical effects, the values it computes are determined by probabilities, instead of the absolutely certain sequence of 0s and 1s that a classical computer produces. Another way of viewing this is that a quantum computer stores many different values at the same time: the different values have different probabilities, but until you force the computer to output a final answer, the values all exist simultaneously. This leads to the possibility that a quantum computer can compute many different possible answers at the same time. So for certain special types of problems, you can use a “brute force” approach that tries all of the possible solutions simultaneously!
这仅适用于某些类型的问题,但整数分解恰好是量子计算机能够比传统计算机更高效地执行的任务之一。因此,如果你能构建一台能够处理数千位数字的量子计算机,你就能像前面解释的那样伪造 RSA 签名:分解公共时钟大小,使用这些因子确定辅助时钟大小,并由此根据公钥值确定私钥挂锁值。
This does only work for certain types of problems, but it just so happens that integer factorization is one of the tasks that can be performed with vastly greater efficiency on quantum computers than on classical ones. Therefore, if you could build a quantum computer that could handle numbers with thousands of digits, you could forge RSA signatures as explained earlier: factorize the public clock size, use the factors to determine the secondary clock size, and use this to determine the private padlock value from the public key value.
2011年,当我写下这些文字时,量子计算的理论发展远超其实践。研究人员已经成功构建出真正的量子计算机,但迄今为止,量子计算机所能执行的最大因数分解是15 = 3 × 5——这远非分解千位RSA时钟大小的因数分解所能比拟!而且,在制造出更大规模的量子计算机之前,还有许多艰巨的实际问题需要解决。因此,没有人知道量子计算机何时,或者是否能够发展到足以彻底破解RSA系统。
As I write these words in 2011, the theory of quantum computing is far ahead of its practice. Researchers have managed to build real quantum computers, but the biggest factorization performed by a quantum computer so far is 15 = 3 × 5—a far cry indeed from factoring a thousand-digit RSA clock size! And there are formidable practical problems to be solved before larger quantum computers can be created. So no one knows when, or if, quantum computers will become large enough to break the RSA system once and for all.
数字签名实践
DIGITAL SIGNATURES IN PRACTICE
在本章开头,我们了解到像你我这样的终端用户并没有太多必要对内容进行数字签名。一些精通计算机的用户确实会对电子邮件等内容进行签名,但对我们大多数人来说,数字签名的主要用途是验证下载的内容。最明显的例子就是下载一个新软件。如果软件已签名,你的计算机会使用签名者的公钥“解锁”签名,并将结果与签名者的“消息”(在本例中是软件本身)进行比较。(如前所述,在实际操作中,软件在签名之前会被压缩成一条更小的消息,称为安全哈希值。)如果解锁的签名与软件匹配,你会收到一条令人鼓舞的消息,否则,你会看到一条更可怕的警告:第 150 页的图中展示了这两种情况的示例。
Early on in this chapter, we learned that end-users like you and me don't have much need to sign things digitally. Some computer-savvy users do sign things like e-mail messages, but for most of us, the primary use of digital signatures is the verification of downloaded content. The most obvious example of this is when you download a new piece of software. If the software is signed, your computer “unlocks” the signature using the signer's public key and compares the results with the signer's “message”—in this case, the software itself. (As mentioned earlier, in practice the software is reduced to a much smaller message called a secure hash before it is signed.) If the unlocked signature matches the software, you get an encouraging message, otherwise, you see a more dire warning: examples of both were shown in the figure on page 150.
正如我们一直强调的那样,我们所有的方案都需要某种可信的“银行”来存储签名者的公钥和时钟大小。幸运的是,正如你可能已经注意到的,你不需要每次下载软件时都跑一趟真正的银行。在现实生活中,存储公钥的可信机构被称为认证机构。所有认证机构都维护着可以通过电子方式联系以下载公钥信息的服务器。因此,当你的机器收到数字签名时,它会附带信息,说明哪个认证机构可以担保签名者的公钥。
As has been emphasized throughout, all of our schemes require some sort of trusted “bank” to store the signers' public keys and clock sizes. Fortunately, as you have probably noticed, you don't need to take a trip to a real bank every time you download some software. In real life, the trusted organizations that store public keys are known as certification authorities. All certification authorities maintain servers that can be contacted electronically to download public key information. So when your machine receives a digital signature, it will be accompanied by information stating which certification authority can vouch for the signer's public key.
您可能已经注意到一个问题:当然,您的计算机可以继续使用指定的证书颁发机构验证签名,但我们如何才能信任该机构本身呢?我们所做的只是将验证一个组织(向您发送软件的组织,例如NanoSoft.com)身份的问题转移到验证另一个组织(证书颁发机构,例如 TrustMe Inc.)身份的问题。信不信由你,这个问题通常由证书颁发机构(TrustMe Inc.)将您转交给另一个证书颁发机构(例如 PleaseTrustUs Ltd.)进行验证来解决,验证方式也是通过数字签名。这种信任链可以无限延伸,但我们始终会遇到同一个问题:我们如何才能信任链末端的组织?答案是,如上图所示,某些组织已被正式指定为所谓的根证书颁发机构,简称根 CA。其中较为知名的根 CA 包括 VeriSign、GlobalSign 和 GeoTrust。当您获取浏览器软件时,许多根 CA 的联系详细信息(包括互联网地址和公钥)都会预先安装在您的浏览器软件中,这就是数字签名的信任链如何锚定在可信的起点的方式。
You have probably already noticed a problem here: sure, your computer can go ahead and verify the signature with the designated certification authority, but how can we trust the authority itself? All we have done is transfer the problem of verifying the identity of one organization (the one that sent you the software, say NanoSoft.com), to the problem of verifying the identity of another organization (the certification authority, say, TrustMe Inc.). Believe it or not, this problem is typically solved by the certification authority (TrustMe Inc.) referring you to yet another certification authority (say, PleaseTrustUs Ltd.) for verification, also via a digital signature. This type of chain of trust can be extended indefinitely, but we will always be stuck with the same problem: how can we trust the organization at the end of the chain? The answer, as shown in the figure above, is that certain organizations have been officially designated as so-called root certificate authorities, or root CAs for short. Among the better-known root CAs are VeriSign, GlobalSign, and GeoTrust. The contact details (including internet addresses and public keys) of a number of root CAs come pre-installed in your browser software when you acquire it, and that is how the chain of trust for digital signatures becomes anchored in a trustworthy starting point.
用于获取验证数字签名所需密钥的信任链。
A chain of trust for obtaining keys needed to verify digital signatures.
悖论已解
A PARADOX RESOLVED
在本章开头,我指出“数字签名”一词本身就可能被视为矛盾:任何数字内容都可以复制,但签名却应该无法复制。这个悖论是如何解决的?答案是,数字签名既依赖于只有签名者知道的秘密,也依赖于被签名的消息。对于特定实体签名的每条消息,秘密(在本章中我们称之为挂锁)保持不变,但每条消息的签名都不同。因此,任何人都可以轻松复制签名这一事实无关紧要:签名无法转移到其他消息,因此仅仅复制它并不构成伪造。
At the start of this chapter, I pointed out that the very phrase “digital signature” could be regarded as an oxymoron: anything digital can be copied, yet a signature should be impossible to copy. How was this paradox resolved? The answer is that a digital signature depends on both a secret known only to the signer and on the message being signed. The secret (which we called a padlock throughout this chapter) stays the same for each message signed by a particular entity, but the signature is different for each message. Thus, the fact that anyone can easily copy the signature is irrelevant: the signature cannot be transferred to a different message, so merely copying it does not create a forgery.
解决这一悖论不仅仅是一个巧妙而美好的想法。数字签名也具有巨大的实际意义:没有它们,我们所知的互联网将不复存在。数据仍然可以使用密码学安全地交换,但验证任何接收数据的来源将变得更加困难。这种深刻的理念与如此广泛的实际影响相结合,意味着数字签名无疑是计算机科学最辉煌的成就之一。
The resolution of this paradox is not just a cunning and beautiful idea. Digital signatures are also of immense practical importance: without them, the internet as we know it would not exist. Data could still be exchanged securely using cryptography, but it would be far more difficult to verify the source of any data received. This combination of a profound idea with such wide practical impact means that digital signatures are, without doubt, one of the most spectacular achievements of computer science.
10
10
什么是可计算?
What Is Computable?
—理查德·费曼(1965年诺贝尔物理学奖获得者)
—RICHARD FEYNMAN (1965 Nobel Prize in physics)
我们已经见识了众多巧妙、强大且优美的算法——这些算法能够将计算机的“裸机”变成触手可及的“天才”。事实上,基于前几章中那些令人赞叹的修辞,我们很自然地会思考,还有什么是计算机无法为我们做到的。如果我们只局限于计算机目前能够做到的事情,答案就显而易见了:有很多实用的任务(大多涉及某种形式的人工智能)是计算机目前无法出色完成的。例如,在英语和中文等语言之间进行高质量的翻译,在繁忙的城市环境中自动控制车辆安全快速地行驶,以及(作为一名教师,这对我来说意义重大)批改学生的作业。
We've now seen quite a number of clever, powerful, and beautiful algorithms—algorithms that turn the bare metal of a computer into a genius at your fingertips. In fact, it would be natural to wonder, based on the rhapsodic rhetoric in the preceding chapters, if there is anything that computers cannot do for us. The answer is absolutely clear if we limit ourselves to what computers can do today: there are plenty of useful tasks (mostly involving some form of artificial intelligence) that computers can't, at present, perform well. Examples include high-quality translation between languages like English and Chinese, automatically controlling a vehicle to drive safely and quickly in a busy city environment, and (as a teacher, this is a big one for me) grading students' work.
然而,正如我们已经看到的,一个真正聪明的算法所能取得的成就往往令人惊讶。也许明天,就会有人发明一种算法,可以完美地驾驶汽车,或者出色地批改我学生的作业。这些看起来确实很难,但它们真的难到无法解决吗?事实上,是否存在一个问题难到无人能发明算法来解决它?在本章中,我们将看到答案是肯定的:有些问题永远无法被计算机解决。这个深刻的事实——有些东西是“可计算的”,有些则不是——与我们在前几章中看到的众多算法胜利形成了有趣的对比。无论未来发明多少聪明的算法,总会有一些问题的答案是“不可计算的”。
Yet, as we have seen already, it is often surprising what a really clever algorithm can achieve. Perhaps tomorrow, someone will invent an algorithm that will drive a car perfectly or do an excellent job of grading my students' work. These do seem like hard problems—but are they impossibly hard? Indeed, is there any problem at all that is so difficult, no one could ever invent an algorithm to solve it? In this chapter, we will see that the answer is a resounding yes: there are problems that can never be solved by computers. This profound fact—that some things are “computable” and others are not—provides an interesting counterpoint to the many algorithmic triumphs we've seen in the preceding chapters. No matter how many clever algorithms are invented in the future, there will always be problems whose answers are “uncomputable.”
不可计算问题的存在本身就足够引人注目,但它们的发现故事更加引人注目。
The existence of uncomputable problems is striking enough on its own, but the story of their discovery is even more remarkable.
早在第一台电子计算机诞生之前,人们就已经知道这些问题的存在!20 世纪 30 年代末,两位数学家,一位美国人,一位英国人,分别独立发现了不可计算的问题——这比二战期间第一台真正的计算机出现还要早几年。美国人是阿隆佐·丘奇,他在计算理论上的开创性工作至今仍是计算机科学许多方面的基础。英国人正是艾伦·图灵,他被普遍认为是计算机科学奠基者中最重要的人物。图灵的工作涵盖了计算思想的方方面面,从复杂的数学理论、深奥的哲学到大胆实用的工程设计。在本章中,我们将追随丘奇和图灵的脚步,踏上一段旅程,最终证明使用计算机完成某项特定任务是不可能的。这段旅程从下一节开始,首先讨论错误和崩溃。
The existence of such problems was known before the first electronic computers were ever built! Two mathematicians, one American and one British, independently discovered uncomputable problems in the late 1930s—several years before the first real computers emerged during the Second World War. The American was Alonzo Church, whose groundbreaking work on the theory of computation remains fundamental to many aspects of computer science. The Briton was none other than Alan Turing, who is commonly regarded as the single most important figure in the founding of computer science. Turing's work spanned the entire spectrum of computational ideas, from intricate mathematical theory and profound philosophy to bold and practical engineering. In this chapter, we will follow in the footsteps of Church and Turing on a journey that will eventually demonstrate the impossibility of using a computer for one particular task. That journey begins in the next section, with a discussion of bugs and crashes.
软件的缺陷、崩溃和可靠性
BUGS, CRASHES, AND THE RELIABILITY OF SOFTWARE
近年来,计算机软件的可靠性已大幅提升,但我们都知道,假设软件能够正常运行并非明智之举。即使是高质量、编写精良的软件,也偶尔会出现一些非预期的情况。最糟糕的情况是,软件会“崩溃”,导致您正在处理的数据或文档(或者您正在玩的电子游戏——根据我的亲身经历,这非常令人沮丧)丢失。但正如任何在 20 世纪 80 年代和 90 年代接触过家用电脑的人都可以证明的那样,计算机程序的崩溃频率远高于 21 世纪。这种改进的原因有很多,但主要原因之一是自动化软件检查工具的巨大进步。换句话说,一旦一组计算机程序员编写了一个大型复杂的计算机程序,他们就可以使用自动化工具来检查新创建的软件中是否存在可能导致程序崩溃的问题。而且,这些自动化检查工具在查找潜在错误方面也越来越出色。
The reliability of computer software has improved tremendously in recent years, but we all know that it's still not a good idea to assume software will work correctly. Very occasionally, even high-quality, well-written software can do something it was not intended to do. In the worst cases, the software will “crash,” and you lose the data or document you were working on (or the video game you were playing—very frustrating, as I know from my own experience). But as anyone who encountered home computers in the 1980s and 90s can testify, computer programs used to crash an awful lot more frequently than they do in the 21st century. There are many reasons for this improvement, but among the chief causes are the great advances in automated software checking tools. In other words, once a team of computer programmers has written a large, complicated computer program, they can use an automatic tool to check their newly created software for problems that might cause it to crash. And these automated checking tools have been getting better and better at finding potential mistakes.
因此,一个自然而然的问题是:自动化软件检查工具能否最终达到能够检测所有计算机程序中所有潜在问题的程度?这当然很好,因为它将一劳永逸地消除软件崩溃的可能性。本章将要学习的惊人之处在于,这种软件涅槃永远不会实现:任何软件检查工具都不可能检测到所有程序中所有可能的崩溃,这一点已经得到证实。
So a natural question to ask would be: will the automated software-checking tools ever get to the point where they can detect all potential problems in all computer programs? This would certainly be nice, since it would eliminate the possibility of software crashes once and for all. The remarkable thing that we'll learn in this chapter is that this software nirvana will never be attained: it is provably impossible for any software-checking tool to detect all possible crashes in all programs.
值得进一步解释一下“可证明不可能”的含义。在大多数科学领域,例如物理学和生物学,科学家会对某些系统的行为方式提出假设,并进行实验来验证这些假设是否正确。但由于实验本身总是存在一定的不确定性,即使实验非常成功,也不可能 100% 确定这些假设是正确的。然而,与物理科学形成鲜明对比的是,数学和计算机科学中的某些结果可以100% 确定。只要你接受数学的基本公理(例如 1 + 1 = 2),数学家使用的演绎推理链就能绝对肯定其他各种陈述是正确的(例如,“任何以 5 结尾的数字都能被 5 整除”)。这种推理不需要计算机:数学家只需用纸和笔就能证明无可争辩的事实。
It's worth commenting a little more on what it means for something to be “provably impossible.” In most sciences, like physics and biology, scientists make hypotheses about the way certain systems behave, and conduct experiments to see if the hypotheses are correct. But because the experiments always have some amount of uncertainty in them, it's not possible to be 100% certain that the hypotheses were correct, even after a very successful experiment. However, in stark contrast to the physical sciences, it is possible to claim 100% certainty about some of the results in mathematics and computer science. As long as you accept the basic axioms of mathematics (such as 1 + 1 = 2), the chain of deductive reasoning used by mathematicians results in absolute certainty that various other statements are true (for example, “any number that ends in a 5 is divisible by 5”). This kind of reasoning does not involve computers: using only a pencil and paper, a mathematician can prove indisputable facts.
因此,在计算机科学中,当我们说“X 可证明不可能”时,我们并非仅仅指X看起来非常困难,或者在实践中可能无法实现。我们的意思是, X 100% 肯定不可能实现,因为有人已经用一系列演绎数学推理证明了这一点。一个简单的例子是“10 的倍数以数字 3 结尾是可证明的不可能”。另一个例子是本章的最终结论:自动软件检查器不可能检测到所有计算机程序中所有可能的崩溃,这可证明是不可能实现的。
So, in computer science, when we say that “X is provably impossible,” we don't just mean that X appears to be very difficult, or might be impossible to achieve in practice. We mean that it is 100% certain that X can never be achieved, because someone has proved it using a chain of deductive, mathematical reasoning. A simple example would be “it is provably impossible that a multiple of 10 ends with the digit 3.” Another example is the final conclusion of this chapter: it is provably impossible for an automated software-checker to detect all possible crashes in all computer programs.
证明某事不真实
PROVING THAT SOMETHING ISN'T TRUE
我们证明崩溃检测程序不可能存在,将使用一种数学家称之为反证法的技术。虽然数学家喜欢宣称自己掌握了这项技术,但实际上人们在日常生活中经常使用它,甚至常常不假思索。我举一个简单的例子。
Our proof that crash-detecting programs are impossible is going to use a technique that mathematicians call proof by contradiction. Although mathematicians like to lay claim to this technique, it's actually something that people use all the time in everyday life, often without even thinking about it. Let me give you a simple example.
首先,我们需要就以下两个事实达成一致,即使是最修正主义的历史学家也不会对此提出异议:
To start with, we need to agree on the following two facts, which would not be disputed by even the most revisionist of historians:
1.美国内战发生于19世纪60年代。
1. The U.S. Civil War took place in the 1860s.
2. 亚伯拉罕·林肯 (Abraham Lincoln) 是内战期间的总统。
2. Abraham Lincoln was president during the Civil War.
现在,假设我说:“亚伯拉罕·林肯出生于1520年。”这句话是对还是错?即使你除了以上两个事实之外对亚伯拉罕·林肯一无所知,你又如何能迅速判断我的陈述是错误的呢?
Now, suppose I made the statement: “Abraham Lincoln was born in 1520.” Is this statement true or false? Even if you knew nothing whatsoever about Abraham Lincoln, apart from the two facts above, how could you quickly determine that my statement is false?
最有可能的是,你的大脑会进行类似如下的一系列推理:(i)没有人能活过 150 岁,所以如果林肯出生于 1520 年,那么他最迟必须在 1670 年去世;(ii)林肯在内战期间担任总统,所以内战一定发生在他去世之前,也就是 1670 年之前;(iii)但这是不可能的,因为每个人都同意内战发生在 19 世纪 60 年代;(iv)因此,林肯不可能出生于 1520 年。
Most likely, your brain would go through a chain of reasoning similar to the following: (i) No one lives for more than 150 years, so if Lincoln was born in 1520, he must have died by 1670 at the absolute latest; (ii) Lincoln was president during the Civil War, so the Civil War must have occurred before he died—that is, before 1670; (iii) but that's impossible, because everyone agrees the Civil War took place in the 1860s; (iv) therefore, Lincoln could not possibly have been born in 1520.
但让我们更仔细地检验一下这个推理。为什么得出最初的陈述为假的结论是合理的呢?这是因为我们证明了这个说法与其他一些已知的事实相矛盾。具体来说,我们证明了最初的陈述暗示内战发生在1670年之前——这与内战发生在19世纪60年代的已知事实相矛盾。
But let's try to examine this reasoning more carefully. Why is it valid to conclude that the initial statement was false? It is because we proved that this claim contradicts some other fact that is known to be true. Specifically, we proved that the initial statement implies the Civil War occurred before 1670—which contradicts the known fact that the Civil War took place in the 1860s.
反证法是一项极其重要的技巧,所以让我们举一个更数学的例子。假设我提出以下断言:“平均而言,人类心脏在 10 分钟内跳动约 6000 次。” 这个断言是对还是错?你可能会立刻产生怀疑,但你会如何向自己证明它是错的呢?在继续阅读之前,请先花几秒钟时间尝试分析一下你的思维过程。
Proof by contradiction is an extremely important technique, so let's do a slightly more mathematical example. Suppose I made the following claim: “On average, a human heart beats about 6000 times in 10 minutes.” Is this claim true or false? You might immediately be suspicious, but how would you go about proving to yourself that it is false? Spend a few seconds now trying to analyze your thought process before reading on.
再次,我们可以用反证法。首先,为了论证的目的,假设该断言为真:人类心脏平均每10分钟跳动6000次。如果这是真的,那么一分钟内会跳动多少次?平均下来,应该是6000除以10,也就是每分钟600次。现在,即使你不是医学专家也知道,这远高于任何正常的脉搏率,正常的脉搏率在每分钟50到150次之间。因此,原始断言与已知事实相矛盾,必然是错误的:人类心脏平均每10分钟跳动6000次的说法并非事实。
Again, we can use proof by contradiction. First, assume for argument's sake that the claim is true: human hearts average 6000 beats in 10 minutes. If that were true, how many beats would occur in just one minute? On average, it would be 6000 divided by 10, or 600 beats per minute. Now, you don't have to be a medical expert to know that this is far higher than any normal pulse rate, which is somewhere between 50 and 150 beats per minute. So the original claim contradicts a known fact and must be false: it is not true that human hearts average 6000 beats in 10 minutes.
用更抽象的术语来说,反证法可以概括如下。假设你怀疑某个陈述S为假,但你想毫无疑问地证明它是假的。首先,你假设S为真。通过运用某种推理,你推导出另一个陈述,比如T,也必然为真。然而,如果已知T为假,你就得出了矛盾。这证明你最初的假设(S)必然是假的。
In more abstract terminology, proof by contradiction can be summarized as follows. Suppose you suspect that some statement S is false, but you would like to prove beyond doubt that it is false. First, you assume that S is true. By applying some reasoning, you work out that some other statement, say T, must also be true. If, however, T is known to be false, you have arrived at a contradiction. This proves that your original assumption (S) must have been false.
数学家会更简洁地表述这一点,例如“S 蕴涵T,但T为假,因此S为假”。简而言之,这就是矛盾证明。下表显示了如何将这种抽象版本的矛盾证明与上面的两个例子联系起来:
A mathematician would state this much more briefly, by saying something like “S implies T, but T is false, therefore S is false.” That is proof by contradiction in a nutshell. The following table shows how to connect this abstract version of proof by contradiction with the two examples above:
至此,我们关于反证法的探索就结束了。本章的最终目标是通过反证法证明,一个能够检测其他程序中所有可能崩溃的程序是不存在的。但在迈向最终目标之前,我们需要熟悉一些关于计算机程序的有趣概念。
For now, our detour into proof by contradiction is finished. The final goal of this chapter will be to prove, by contradiction, that a program which detects all possible crashes in other programs cannot exist. But before marching on toward this final goal, we need to gain familiarity with some interesting concepts about computer programs.
分析其他程序的程序
PROGRAMS THAT ANALYZE OTHER PROGRAMS
计算机严格遵循程序中的指令。它们完全确定地执行这些指令,因此每次运行程序时,其输出都是完全相同的。对吗?还是错?事实上,我提供的信息不足以回答这个问题。某些简单的计算机程序确实每次运行时都会产生完全相同的输出,但我们日常使用的大多数程序每次运行时看起来都非常不同。想想你最喜欢的文字处理程序:每次启动时屏幕看起来都一样吗?当然不是——这取决于你打开的是什么文档。如果我使用 Microsoft Word 打开文件“address-list.docx”,屏幕上会显示我保存在电脑上的地址列表。如果我使用 Microsoft Word 打开文件“bank-letter.docx”,我会看到昨天写给银行的一封信的文本。(如果这里的“.docx”对你来说很神秘,请查看下一页上的方框,了解文件扩展名。)
Computers slavishly follow the exact instructions in their computer programs. They do this completely deterministically, so the output of a computer program is exactly the same every time you run it. Right? Or wrong? In fact, I haven't given you enough information to answer this question. It's true that certain simple computer programs produce exactly the same output every time they are run, but most of the programs we use every day look very different every time we run them. Consider your favorite word processing program: does the screen look the same every time it starts up? Of course not—it depends on what document you opened. If I use Microsoft Word to open the file “address-list.docx,” the screen will display a list of addresses that I keep on my computer. If I use Microsoft Word to open the file “bank-letter.docx,” I see the text of a letter I wrote to my bank yesterday. (If the “.docx” here seems mysterious to you, check out the box on the facing page to find out about file name extensions.)
让我们明确一点:在这两种情况下,我运行的都是同一个计算机程序,也就是 Microsoft Word。只是输入方式不同。不要被所有现代操作系统都允许双击文档来运行计算机程序这一事实所迷惑。这只是你那些友好的计算机公司(很可能是苹果或微软)为你提供的便利。当你双击文档时,某个计算机程序就会运行,并且该程序会将该文档作为其输入。程序的输出就是你在屏幕上看到的内容,当然,这取决于你点击的是什么文档。
Let's be very clear about one thing: in both cases, I'm running exactly the same computer program, which is Microsoft Word. It's just that the inputs are different in each case. Don't be fooled by the fact that all modern operating systems let you run a computer program by double-clicking on a document. That is just a convenience that your friendly computer company (most likely Apple or Microsoft) has provided you. When you double-click on a document, a certain computer program gets run, and that program uses the document as its input. The output of the program is what you see on the screen, and naturally it depends on what document you clicked on.
在本章中,我将使用类似“abcd.txt”的文件名。句点后面的部分称为文件名的“扩展名”——在本例中,“abcd.txt”的扩展名是“txt”。大多数操作系统使用文件名的扩展名来确定文件包含的数据类型。例如,“.txt”文件通常包含纯文本,“.html”文件通常包含网页,“.docx”文件包含 Microsoft Word 文档。某些操作系统默认隐藏这些扩展名,因此除非您关闭操作系统中的“隐藏扩展名”功能,否则您可能看不到它们。快速在网络上搜索“取消隐藏文件扩展名”,即可找到有关如何执行此操作的说明。
Throughout this chapter, I'll be using file names like “abcd.txt.” The part after the period is called the “extension” of the file name—in this case, the extension of “abcd.txt” is “txt.” Most operating systems use the extension of a file name to decide what type of data the file contains. For example, a “.txt” file typically contains plain text, a “.html” file typically contains a web page, and a “.docx” file contains a Microsoft Word document. Some operating systems hide these extensions by default, so you might not see them unless you turn off the “hide extensions” feature in your operating system. A quick web search for “unhide file extensions” will turn up instructions on how to do this.
有关文件扩展名的一些技术细节。
Some technical details about file name extensions.
实际上,计算机程序的输入和输出比这复杂得多。例如,当你点击菜单或在程序中输入时,你就是在给它提供额外的输入。当你保存文档或其他文件时,程序也会创建额外的输出。但为了简单起见,我们假设程序只接受一个输入,即存储在计算机上的一个文件。我们还假设程序只产生一个输出,即显示器上的一个图形窗口。
In reality, the input and output of computer programs is quite a bit more complex than this. For instance, when you click on menus or type into a program, you are giving it additional input. And when you save a document or any other file, the program is creating additional output. But to keep things simple, let's imagine that programs accept exactly one input, which is a file stored on your computer. And we'll also imagine that programs produce exactly one output, which is a graphical window on your monitor.
不幸的是,现代双击文件的便利性掩盖了一个重要问题。您的操作系统会使用各种巧妙的技巧来猜测您在双击文件时想要运行哪个程序。但务必认识到,可以使用任何程序打开任何文件。换句话说,您可以使用任何文件作为输入来运行任何程序。您该如何做到这一点?下一页的方框列出了您可以尝试的几种方法。这些方法并不适用于所有操作系统,也不适用于所有输入文件选择——不同的操作系统以不同的方式启动程序,有时出于安全考虑,它们会限制输入文件的选择。尽管如此,我还是强烈建议您在自己的计算机上试验几分钟,以确信您可以使用各种不同类型的输入文件来运行您最喜欢的文字处理程序。
Unfortunately, the modern convenience of double-clicking on files clouds an important issue here. Your operating system uses various clever tricks to guess which program you would like to run whenever you double-click on a file. But it's very important to realize that it's possible to open any file using any program. Or to put it another way, you can run any program using any file as its input. How can you do this? The box on the next page lists several methods you can try. These methods will not work on all operating systems, or on all choices of input file—different operating systems launch programs in different ways, and they sometimes limit the choice of input file due to security concerns. Nevertheless, I strongly urge you to experiment for a few minutes with your own computer, to convince yourself that you can run your favorite word processing program with various different types of input files.
您可以使用以下三种方法使用 stuff.txt 作为输入文件来运行程序 Microsoft Word:
Here are three ways you could run the program Microsoft Word using stuff.txt as the input file:
• 右键单击 stuff.txt,选择“打开方式...”,然后选择 Microsoft Word。
• Right-click on stuff.txt, choose “Open with…,” and select Microsoft Word.
• 首先,使用操作系统的功能在桌面上创建 Microsoft Word 的快捷方式。然后将 stuff.txt 拖到此 Microsoft Word 快捷方式上。
• First, use the features of your operating system to place a shortcut to Microsoft Word on your desktop. Then drag stuff.txt onto this Microsoft Word shortcut.
• 直接打开 Microsoft Word 应用程序,转到“文件”菜单,选择“打开”命令,确保选择了显示“所有文件”的选项,然后选择 stuff.txt。
• Open the Microsoft Word application directly, go to the “File” menu, choose the “Open” command, make sure the option to display “all files” is selected, then choose stuff.txt.
使用特定文件作为输入来运行程序的各种方式。
Various ways of running a program with a particular file as its input.
Microsoft Excel 以“photo.jpg”作为输入运行。输出是乱码,但重要的是,原则上,你可以用任何你想要的输入运行任何程序。
Microsoft Excel run with “photo.jpg” as its input. The output is garbage, but the important point is that you can, in principle, run any program on any input you want.
显然,如果使用非指定程序打开文件,可能会得到意想不到的结果。上图展示了我用电子表格程序 Microsoft Excel 打开图片文件“photo.jpg”时会发生什么。在这种情况下,输出结果乱七八糟,对任何人都没有用。但电子表格程序确实运行了,并且确实产生了一些输出。
Obviously, you can get rather unexpected results if you open a file using a program it was not intended for. In the figure above, you can see what happens if I open the picture file “photo.jpg” with my spreadsheet program, Microsoft Excel. In this case, the output is garbage and is no use to anyone. But the spreadsheet program did run, and did produce some output.
这听起来可能已经很荒谬了,但我们可以更进一步。记住,计算机程序本身是以文件的形式存储在计算机磁盘上的。通常,这些程序的名称以“.exe”结尾,这是“可执行文件”的缩写——这意味着你可以“执行”或运行该程序。因此,由于计算机程序只是磁盘上的文件,我们可以将一个计算机程序作为另一个计算机程序的输入。举一个具体的例子,Microsoft Word 程序在我的计算机上存储为文件“WINWORD.EXE”。因此,通过以文件 WINWORD.EXE 作为输入运行我的电子表格程序,我就能生成你在下一页的图中看到的奇妙的垃圾数据。
This may already seem ridiculous, but we can take the craziness one step further. Remember that computer programs are themselves stored on the computer's disk as files. Often, these programs have a name that ends in “.exe,” which is short for “executable”—this just means that you can “execute,” or run, the program. So because computer programs are just files on the disk, we can feed one computer program as input to another computer program. As one specific example, the Microsoft Word program is stored on my computer as the file “WINWORD.EXE.” So by running my spreadsheet program with the file WINWORD.EXE as input, I can produce the wonderful garbage you see in the figure on the facing page.
Microsoft Excel 检查 Microsoft Word。当 Excel 打开
文件 WINWORD.EXE 时,结果不出所料地是一堆垃圾。
Microsoft Excel examines Microsoft Word. When Excel opens
the file WINWORD.EXE, the result is—unsurprisingly—garbage.
再次强调,这个实验非常值得您亲自尝试一下。为此,您需要找到文件 WINWORD.EXE。在我的电脑上,WINWORD.EXE 位于文件夹“C:\Program Files\Microsoft Office\Office12”中,但具体位置取决于您运行的操作系统和安装的 Microsoft Office 版本。您可能还需要启用“隐藏文件”查看功能才能看到此文件夹。顺便说一句,您可以使用任何电子表格和文字处理程序进行此实验(以及下面的实验),因此您无需 Microsoft Office 即可进行尝试。
Again, it would be well worth trying this experiment for yourself. To do that, you will need to locate the file WINWORD.EXE. On my computer, WINWORD.EXE lives in the folder “C:\Program Files\Microsoft Office\Office12,” but the exact location depends on what operating system you are running and what version of Microsoft Office is installed. You may also need to enable the viewing of “hidden files” before you can see this folder. And, by the way, you can do this experiment (and one below) with any spreadsheet and word processing programs, so you don't need Microsoft Office to try it.
这里还有最后一个愚蠢的方面。如果我们运行一个计算机程序,会怎么样?例如,如果我运行 Microsoft Word,并使用文件 WINWORD.EXE 作为输入,会怎么样?嗯,这个实验很容易做。下一页的图显示了我在电脑上测试的结果。和前面几个例子一样,程序运行正常,但屏幕上的输出大部分都是乱码。(再次提醒,你自己试试吧。)
One final level of stupidity is possible here. What if we ran a computer program on itself? For example, what if I ran Microsoft Word, using the file WINWORD.EXE as input? Well, it's easy enough to try this experiment. The figure on the next page shows the result when I try it on my computer. As with the previous few examples, the program runs just fine, but the output on the screen is mostly garbage. (Once again, try it for yourself.)
那么,这一切的意义何在?本节的目的是让你了解运行程序时可以做的一些比较晦涩的事情。现在,你应该已经熟悉了三个略显奇怪的想法,它们在后面会非常重要。首先,任何程序都可以使用任何文件作为输入来运行,但最终的输出通常是垃圾,除非输入文件是特意为与你选择运行的程序配合而生成的。其次,我们发现计算机程序以文件的形式存储在计算机磁盘上,因此一个程序可以使用另一个程序作为其输入文件来运行。第三,我们意识到计算机程序可以使用其自身的文件作为输入来运行。到目前为止,第二和第三个操作总是会产生垃圾,但在下一节中,我们将看到一个引人入胜的例子,这些技巧最终会取得一些成果。
So, what is the point of all this? The purpose of this section was to acquaint you with some of the more obscure things you can do when running a program. By now, you should be comfortable with three slightly strange ideas that will be very important later. First, there is the notion that any program can be run with any file as input, but the resulting output will usually be garbage unless the input file was intentionally produced to work with the program you chose to run. Second, we found out that computer programs are stored as files on computer disks, and therefore one program can be run with another program as its input file. Third, we realized that a computer program can be run using its own file as the input. So far, the second and third activities always produced garbage, but in the next section we will see a fascinating instance in which these tricks finally bear some fruit.
Microsoft Word 会检查自身。打开的文档是 WINWORD.EXE 文件,它是您单击 Microsoft Word 时实际运行的计算机程序。
Microsoft Word examines itself. The open document is the file WINWORD.EXE, which is the actual computer program run when you click on Microsoft Word.
有些程序无法存在
SOME PROGRAMS CAN'T EXIST
计算机非常擅长执行简单指令——事实上,现代计算机每秒执行数十亿次简单指令。因此,你可能会认为,任何可以用简单、精确的英语描述的任务,都可以写成计算机程序并由计算机执行。本节的目的是让你明白,事实并非如此:有些简单、精确的英语语句,实际上根本不可能写成计算机程序。
Computers are great at executing simple instructions—in fact, modern computers execute simple instructions billions of times every second. So you might think that any task that could be described in simple, precise English could be written down as a computer program and executed by a computer. My objective in this section is to convince you that the opposite is true: there are some simple, precise English statements that are literally impossible to write down as a computer program.
一些简单的“是”或“否”程序
Some Simple Yes-No Programs
为了在本节中尽可能简化,我们将仅考虑一组非常枯燥的计算机程序。我们将它们称为“是-否”程序,因为它们唯一能做的事情就是弹出一个对话框,并且该对话框可以包含单词“是”或“否”。例如,几分钟前我编写了一个名为 ProgramA.exe 的计算机程序,它除了弹出以下对话框外什么也不做:
To keep things as simple as possible in this section, we will consider only a very boring set of computer programs. We'll call these “yes-no” programs, because the only thing these programs can do is pop up a single dialog box, and the dialog box can contain either the word “yes” or the word “no.” For example, a few minutes ago I wrote a computer program called ProgramA.exe, which does nothing but produce the following dialog box:
请注意,通过查看对话框的标题栏,您可以看到产生此输出的程序的名称 - 在本例中为 ProgramA.exe。
Note that by looking in the title bar of the dialog box, you can see the name of the program that produced this output—in this case, ProgramA.exe.
我还编写了另一个计算机程序,名为 ProgramB.exe,它输出“否”而不是“是”:
I also wrote a different computer program called ProgramB.exe, which outputs “no” instead of “yes”:
程序A和程序B极其简单——简单到它们不需要任何输入(即使有输入,它们也会忽略)。换句话说,它们是程序的典范,无论输入如何,它们每次运行时的行为都完全相同。
ProgramA and ProgramB are extremely simple—so simple, in fact, that they do not require any input (if they do receive input, they ignore it). In other words, they are examples of programs that really do behave exactly the same every time they are run, regardless of any input they may be given.
作为一个更有趣的“是-否”程序示例,我创建了一个名为 SizeChecker.exe 的程序。该程序接受一个文件作为输入,如果该文件大于 10 KB,则输出“是”,否则输出“否”。如果我右键单击一个 50 MB 的视频文件(例如,mymovie.mpg),选择“打开方式...”,然后选择 SizeChecker.exe,我将看到以下输出:
As a more interesting example of one of these yes-no programs, I created a program called SizeChecker.exe. This program takes one file as input and outputs “yes” if that file is bigger than 10 kilobytes and otherwise outputs “no.” If I right-click on a 50-megabyte video file (say, mymovie.mpg), choose “Open with…,” and select SizeChecker.exe, I will see the following output:
另一方面,如果我对一封 3 KB 的小电子邮件(比如 myemail.msg)运行相同的程序,我当然会看到不同的输出:
On the other hand, if I run the same program on a small 3-kilobyte e-mail message (say, myemail.msg), I will, of course, see a different output:
因此,SizeChecker.exe 是一个“是-否”程序的示例,它有时输出“是”,有时输出“否”。
Therefore, SizeChecker.exe is an example of a yes-no program that sometimes outputs “yes” and sometimes “no.”
现在考虑下面这个略有不同的程序,我们将其命名为 NameSize.exe。该程序检查其输入文件的名称。如果文件名至少有一个字符长,NameSize.exe 会输出“yes”;否则,输出“no”。这个程序可能的输出是什么?根据定义,任何输入文件的名称至少有一个字符长(否则,该文件根本没有名称,您一开始就无法选择它)。因此,无论输入是什么,NameSize.exe 总是会输出“yes”。
Now consider the following slightly different program, which we'll call NameSize.exe. This program examines the name of its input file. If the file name is at least one character long, NameSize.exe outputs “yes”; otherwise, it outputs “no.” What are the possible outputs of this program? Well, by definition, the name of any input file is at least one character long (otherwise, the file would have no name at all, and you couldn't select it in the first place). Therefore, NameSize.exe will always output “yes,” regardless of its input.
顺便说一句,上面提到的最后几个程序是我们首次介绍的程序示例,它们在输入其他程序时不会产生垃圾数据。例如,事实证明文件 NameSize.exe 的大小只有大约 8 KB。因此,如果我以 NameSize.exe 作为输入运行 SizeChecker.exe,输出为“否”(因为 NameSize.exe 不超过 10 KB)。我们甚至可以在 SizeChecker.exe 自身上运行它。这次的输出为“是”,因为事实证明 SizeChecker.exe 大于 10 KB——实际上大约是 12 KB。同样,我们可以以 NameSize.exe 自身作为输入运行它;输出也将为“是”,因为文件名“Name-Size.exe”至少包含一个字符。我们目前讨论的所有“是-否”程序无疑相当枯燥,但理解它们的行为非常重要,因此请逐行阅读下一页的表格,确保您同意每个输出。
By the way, the last few programs mentioned above are our first examples of programs that do not produce garbage when they are given other programs as input. For example, it turns out that the size of the file NameSize.exe is only about 8 kilobytes. So if I run SizeChecker.exe with NameSize.exe as the input, the output is “no” (because NameSize.exe is not more than 10 kilobytes). We can even run SizeChecker.exe on itself. The output this time is “yes,” because it turns out that SizeChecker.exe is larger than 10 kilobytes—about 12 kilobytes, in fact. Similarly, we could run NameSize.exe with itself as input; the output would be “yes” since the file name “Name-Size.exe” contains at least one character. All of the yes-no programs we have discussed this far are admittedly rather dull, but it's important to understand their behavior, so work through the table on the facing page line by line, making sure you agree with each output.
AlwaysYes.exe:一个分析其他程序的“是-否”程序
AlwaysYes.exe: A Yes-No Program That Analyzes Other Programs
现在我们可以考虑一些更有趣的是非判断程序了。我们要研究的第一个程序名为“AlwaysYes.exe”。该程序会检查给定的输入文件,如果输入文件本身是一个始终输出“是”的是非判断程序,则输出“是”。否则,AlwaysYes.exe 的输出为“否”。需要注意的是,AlwaysYes.exe 可以完美处理任何类型的输入文件。如果输入的不是可执行程序(例如,address-list.docx),它将输出“否”。如果输入的是可执行程序,但不是是非判断程序(例如,WINWORD.EXE),它将输出“否”。如果输入的是是非判断程序,但有时输出“否”,则 AlwaysYes.exe 会输出“否”。AlwaysYes.exe 能够输出“是”的唯一方式是,输入一个无论输入是什么都始终输出“是”的是非判断程序。在目前的讨论中,我们已经看到了两个类似的程序示例:ProgramA.exe 和 NameSize.exe。下一页的表格总结了 AlwaysYes.exe 对各种不同输入文件的输出,包括 AlwaysYes.exe 自身运行的可能性。正如您在表格最后一行所看到的,AlwaysYes.exe 对自身运行会输出“no”,因为至少有一些输入文件会输出“no”。
We're now in a position to think about some much more interesting yes-no programs. The first one we'll investigate is called “AlwaysYes.exe.” This program examines the input file it is given and outputs “yes” if the input file is itself a yes-no program that always outputs “yes.” Otherwise, the output of AlwaysYes.exe is “no.” Note that AlwaysYes.exe works perfectly well on any kind of input file. If you give it an input that isn't an executable program (e.g., address-list.docx), it will output “no.” If you give it an input that is an executable program, but isn't a yes-no program (e.g., WINWORD.EXE), it will output “no.” If you give it an input that is a yes-no program, but it's a program that sometimes outputs “no,” then AlwaysYes.exe outputs “no.” The only way that AlwaysYes.exe can output “yes” is if you input a yes-no program that always outputs “yes,” regardless of its input. In our discussions so far, we've seen two examples of programs like this: ProgramA.exe, and NameSize.exe. The table on the next page summarizes the output of AlwaysYes.exe on various different input files, including the possibility of running AlwaysYes.exe on itself. As you can see in the last line of the table, AlwaysYes.exe outputs “no” when run on itself, because there are at least some input files on which it outputs “no.”
| 程序运行 | 输入文件 | 输出 |
| 程序A.exe | 地址列表.docx | 是的 |
| 程序A.exe | 程序A.exe | 是的 |
| 程序B.exe | 地址列表.docx | 不 |
| 程序B.exe | 程序A.exe | 不 |
| 尺寸检查器 | 我的电影.mpg (50MB) | 是的 |
| 尺寸检查器 | 我的电子邮件.msg (3KB) | 不 |
| 尺寸检查器 | NameSize.exe (8KB) | 不 |
| 尺寸检查器 | SizeChecker.exe (12KB) | 是的 |
| 名称大小工具 | 我的电影.mpg | 是的 |
| 名称大小工具 | 程序A.exe | 是的 |
| 名称大小工具 | 名称大小工具 | 是的 |
一些简单的“是-否”程序的输出。请注意以下两类程序的区别:一类程序无论输入如何,始终输出“是”(例如 ProgramA.exe 和 NameSize.exe);另一类程序有时输出“否”(例如 SizeChecker.exe),有时也输出“否”(例如 ProgramB.exe)。
The outputs of some simple yes-no programs. Note the distinction between programs that always output “yes,” regardless of their input (e.g., ProgramA.exe, NameSize.exe), and programs that output “no” either sometimes (e.g., SizeChecker.exe) or always (e.g., ProgramB.exe).
您可能已经注意到,在该表的倒数第二行中出现了一个名为 Freeze.exe 的程序,但该程序目前尚未被描述。Freeze.exe 会执行计算机程序中最令人恼火的操作之一:它会“冻结”(无论输入什么)。您可能也经历过这种情况:电子游戏或应用程序似乎突然锁定(或“冻结”),并且拒绝响应任何输入。之后,您唯一的选择就是终止该程序。如果这不起作用,您甚至可能需要关闭电源(有时,在使用笔记本电脑时,这需要取出电池!)并重新启动。计算机程序可能会由于各种不同的原因而冻结。有时,这是由于“死锁”引起的,这在第 8 章中讨论过。在其他情况下,程序可能正忙于执行永无止境的计算,例如,反复搜索实际上并不存在的数据。
In the next-to-last line of this table, you may have noticed the appearance of a program called Freeze.exe, which has not been described yet. Freeze.exe is a program that does one of the most annoying things a computer program can do: it “freezes” (no matter what its input is). You have probably experienced this yourself, when a video game or an application program seems to just lock up (or “freeze”) and refuses to respond to any more input whatsoever. After that, your only option is to kill the program. If that doesn't work, you might even need to turn off the power (sometimes, when using a laptop, this requires removing the batteries!) and reboot. Computer programs can freeze for a variety of different reasons. Sometimes, it is due to “deadlock,” which was discussed in chapter 8. In other cases, the program might be busy performing a calculation that will never end—for example, repeatedly searching for a piece of data that is not actually present.
AlwaysYes.exe 输出
AlwaysYes.exe outputs
| 输入文件 | 输出 |
| 地址列表.docx | 不 |
| 我的电影.mpg | 不 |
| 运行Word程序 | 不 |
| 程序A.exe | 是的 |
| 程序B.exe | 不 |
| 名称大小工具 | 是的 |
| 尺寸检查器 | 不 |
| 冻结程序 | 不 |
| AlwaysYes.exe | 不 |
AlwaysYes.exe 对各种输入的输出。唯一能输出“是”的输入是“是-否”程序,它们总是输出“是”——在本例中是 ProgramA.exe 和 NameSize.exe。
The outputs of AlwaysYes.exe for various inputs. The only inputs that produce a “yes” are yes-no programs that always output “yes”—in this case, ProgramA.exe and NameSize.exe.
无论如何,我们不需要了解有关冻结程序的细节。我们只需要知道当 AlwaysYes.exe 收到这样的程序作为输入时应该做什么。事实上,AlwaysYes.exe 的定义非常清晰,答案显而易见:如果 AlwaysYes.exe 的输入始终输出“yes”,它就输出“yes”;否则,它输出“no”。因此,当输入是像 Freeze.exe 这样的程序时,AlwaysYes.exe 必须输出“no”,这就是我们在上表倒数第二行看到的内容。
In any case, we don't need to understand the details about programs that freeze. We just need to know what AlwaysYes.exe should do when it's given such a program as input. In fact, AlwaysYes.exe was defined carefully so that the answer is clear: AlwaysYes.exe outputs “yes” if its input always outputs “yes”; otherwise, it outputs “no.” Therefore, when the input is a program like Freeze.exe, AlwaysYes.exe must output “no,” and this is what we see in the next-to-last line of the table above.
YesOnSelf.exe:AlwaysYes.exe 的更简单变体
YesOnSelf.exe: A Simpler Variant of AlwaysYes.exe
您可能已经意识到 AlwaysYes.exe 是一个非常聪明且实用的程序,因为它可以分析其他程序并预测它们的输出。我承认我实际上并没有编写这个程序——我只是描述了如果由我编写它会如何运行。现在,我将介绍另一个名为 YesOnSelf.exe 的程序。该程序与 AlwaysYes.exe 类似,但更简单。YesOnSelf.exe 并非在输入文件始终输出“yes”时才输出“yes”,而是在输入文件输出“yes”时,在自身运行时才输出“yes” ;否则,YesOnSelf.exe 输出“no”。换句话说,如果我将 SizeChecker.exe 作为 YesOnSelf.exe 的输入,那么 YesOnSelf.exe 会对 SizeChecker.exe 进行某种分析,以确定当 SizeChecker.exe 以 SizeChecker.exe 作为输入运行时的输出。正如我们已经发现的(参见第 185 页的表格),SizeChecker.exe 自身的输出是“yes”。因此,YesOnSelf.exe 在 SizeChecker.exe 上的输出也为“是”。您可以使用同样的推理来填充 YesOnSelf.exe 对其他各种输入的输出。请注意,如果输入文件不是“是-否”程序,则 YesOnSelf.exe 会自动输出“否”。上表显示了 YesOnSelf.exe 的一些输出——请尝试确认您理解了表中的每一行,因为在继续阅读之前了解 YesOnSelf.exe 的行为非常重要。
It may have already occurred to you that AlwaysYes.exe is a rather clever and useful program, since it can analyze other programs and predict their outputs. I will admit that I didn't actually write this program—I just described how it would behave, if I had written it. And now I am going to describe another program, called YesOnSelf.exe. This program is similar to AlwaysYes.exe, but simpler. Instead of outputting “yes” if the input file always outputs “yes,” YesOnSelf.exe outputs “yes” if the input file outputs “yes” when run on itself; otherwise, YesOnSelf.exe outputs “no.” In other words, if I provide SizeChecker.exe as the input to YesOnSelf.exe, then YesOnSelf.exe will do some kind of analysis on SizeChecker.exe to determine what the output is when SizeChecker.exe is run with SizeChecker.exe as the input. As we already discovered (see the table on page 185), the output of SizeChecker.exe on itself is “yes.” Therefore, the output of YesOnSelf.exe on SizeChecker.exe is “yes” too. You can use the same kind of reasoning to fill in the outputs of YesOnSelf.exe for various other inputs. Note that if the input file isn't a yes-no program, then YesOnSelf.exe automatically outputs “no.” The table above shows some of the outputs for YesOnSelf.exe—try to verify that you understand each line of this table, since it's very important to understand the behavior of YesOnSelf.exe before reading on.
YesOnSelf.exe 输出
YesOnSelf.exe outputs
| 输入文件 | 输出 |
| 地址列表.docx | 不 |
| 我的电影.mpg | 不 |
| 运行Word程序 | 不 |
| 程序A.exe | 是的 |
| 程序B.exe | 不 |
| 名称大小工具 | 是的 |
| 尺寸检查器 | 是的 |
| 冻结程序 | 不 |
| AlwaysYes.exe | 不 |
| YesOnSelf.exe | ??? |
YesOnSelf.exe 对各种输入的输出。唯一能输出“是”的输入是“是-否”程序,它们在将自身作为输入时输出“是”——在本例中是 ProgramA.exe、NameSize.exe 和 SizeChecker.exe。表格中的最后一行有点神秘,因为似乎任何一种可能的输出都可能是正确的。本文将对此进行更详细的讨论。
The outputs of YesOnSelf.exe for various inputs. The only inputs that produce a “yes” are yes-no programs that output “yes” when given themselves as input—in this case, ProgramA.exe, NameSize.exe, and SizeChecker.exe. The last line in the table is something of a mystery, since it seems as though either possible output might be correct. The text discusses this in more detail.
关于这个相当有趣的程序 YesOnSelf.exe,我们还需要注意两点。首先,看一下上表的最后一行。当 YesOnSelf.exe 被输入 YesOnSelf.exe 文件时,它应该输出什么?幸运的是,只有两种可能性,所以我们可以依次考虑每一种。如果输出是“yes”,我们知道(根据 YesOnSelf.exe 的定义),YesOnSelf.exe 在自身上运行应该输出“yes”。这听起来有点绕口令,但如果你仔细推导,就会发现一切都完全一致,所以你可能会倾向于得出“yes”就是正确答案的结论。
We need to note two more things about this rather interesting program, YesOnSelf.exe. First, take a look at the last line in the table above. What should be the output of YesOnSelf.exe, when it is given the file YesOnSelf.exe as an input? Luckily, there are only two possibilities, so we can consider each one in turn. If the output is “yes,” we know that (according to the definition of YesOnSelf.exe), YesOnSelf.exe should output “yes” when run on itself. This is a bit of a tongue twister, but if you reason through it carefully, you'll see that everything is perfectly consistent, so you might be tempted to conclude that “yes” is the right answer.
但我们先别急。如果 YesOnSelf.exe 自身运行时的输出恰好是“no”呢?嗯,这意味着(同样,根据 YesOnSelf.exe 的定义)YesOnSelf.exe 自身运行时应该输出“no”。同样,这句话完全符合逻辑!看来 YesOnSelf.exe 确实可以选择自己的输出。只要它坚持自己的选择,它的答案就是正确的。YesOnSelf.exe 行为中这种神秘的自由很快就会被揭示为一座相当危险的冰山一角,但目前我们先不深入探讨这个问题。
But let's not be too hasty. What if the output of YesOnSelf.exe when run on itself happened to be “no”? Well, it would mean that (again, according to the definition of YesOnSelf.exe) YesOnSelf.exe should output “no” when run on itself. Again, this statement is perfectly consistent! It seems like YesOnSelf.exe can actually choose what its output should be. As long as it sticks to its choice, its answer will be correct. This mysterious freedom in the behavior of YesOnSelf.exe will soon turn out to be the innocuous tip of a rather treacherous iceberg, but for now we will not explore this issue further.
关于 YesOnSelf.exe 需要注意的第二点是,与稍微复杂一些的 AlwaysYes.exe 一样,我实际上并没有编写这个程序。我只是描述了它的行为。但是,请注意,如果我们假设我编写了AlwaysYes.exe,那么创建 YesOnSelf.exe 就很容易了。为什么?因为 YesOnSelf.exe 比 AlwaysYes.exe 更简单:它只需要检查一个可能的输入,而不是所有可能的输入。
The second thing to note about YesOnSelf.exe is that, as with the slightly more complicated AlwaysYes.exe, I didn't actually write the program. All I did was describe its behavior. However, note that if we assume I did write AlwaysYes.exe, then it would be easy to create YesOnSelf.exe. Why? Because YesOnSelf.exe is simpler than AlwaysYes.exe: it only has to examine one possible input, rather than all possible inputs.
AntiYesOnSelf.exe:YesOnSelf.exe 的反面
AntiYesOnSelf.exe: The Opposite of YesOnSelf.exe
是时候喘口气,回想一下我们想要达到的目标了。本章的目标是证明一个崩溃查找程序不可能存在。但我们的近期目标并没有那么高远。在本节中,我们只是试图找到一个不可能存在的程序的例子。这将是我们迈向最终目标的一块有用的垫脚石,因为一旦我们知道如何证明某个程序不可能存在,那么在崩溃查找程序上使用同样的技术就相当简单了。好消息是,我们已经非常接近这个垫脚石目标了。我们将再研究一个“是-否”程序,然后任务就完成了。
It's time to take a breath and remember where we are trying to get to. The objective of this chapter is to prove that a crash-finding program cannot exist. But our immediate objective is less lofty. In this section, we are merely trying to find an example of some program that cannot exist. This will be a useful steppingstone on the way to our ultimate goal, because once we've seen how to prove that a certain program can't exist, it will be reasonably straightforward to use the same technique on a crash-finding program. The good news is, we are very close to this steppingstone goal. We will investigate one more yes-no program, and the job will be done.
这个新程序名为“AntiYesOnSelf.exe”。顾名思义,它与 YesOnSelf.exe 非常相似——事实上,它们完全相同,只是输出相反。因此,如果 YesOnSelf.exe 在给定某个输入的情况下输出“是”,那么 AntiYesOnSelf.exe 也会在同一个输入上输出“否”。如果 YesOnSelf.exe 在某个输入上输出“否”,那么 AntiYesOnSelf.exe 也会在同一个输入上输出“是”。
The new program is called “AntiYesOnSelf.exe.” As its name suggests, it is very similar to YesOnSelf.exe—in fact, it is identical, except that its outputs are reversed. So ifYesOnSelf.exe would output “yes” given a certain input, then AntiYesOnSelf.exe would output “no” on that same input. And if YesOnSelf.exe outputs “no” on an input, AntiYesOnSelf.exe outputs “yes” on that input.
每当输入文件是“是-否”程序时,AntiYesOn-Self.exe 都会回答以下问题:
Whenever the input file is a yes-no program, AntiYesOn-Self.exe answers the question:
输入程序在自身运行时会输出“否”吗?
Will the input program, when run on itself, output “no”?
AntiYesOnSelf.exe 行为的简要描述。
A concise description of the behavior of AntiYesOnSelf.exe.
虽然这相当于对 AntiYesOnSelf.exe 行为的完整而精确的定义,但它有助于更明确地阐明其行为。回想一下,如果 YesOnSelf.exe 的输入在自身运行时输出“yes”,则输出“yes”,否则输出“no”。因此,如果 AntiYesOnSelf.exe 的输入在自身运行时输出“yes”,则输出“no”,否则输出“yes”。或者换句话说,AntiYesOnSelf.exe 回答了关于其输入的以下问题:“输入文件在自身运行时不会输出‘yes’,这是真的吗?”
Although that amounts to a complete and precise definition of AntiYesOnSelf.exe's behavior, it will help to spell out the behavior even more explicitly. Recall that YesOnSelf.exe outputs “yes” if its input would output “yes” when run on itself, and “no” otherwise. Therefore, AntiYesOnSelf.exe outputs “no” if its input would output “yes” when run on itself, and “yes” otherwise. Or to put it another way, AntiYesOnSelf.exe answers the following question about its input: “Is it true that the input file, when run on itself, will not output ‘yes'?”
无可否认,AntiYesOnSelf.exe 的描述又是一个绕口令。你可能认为将其改写为“输入文件在自身运行时会输出‘否’吗?”会更简单。为什么这样写是错误的?为什么我们需要关于不输出“是”的法律术语,而不是关于输出“否”的更简单的陈述?答案是,程序有时可以执行除了输出“是”或“否”之外的其他操作。因此,如果有人告诉我们某个程序不输出“是”,我们不能自动得出结论说它输出的是“否”。例如,它可能会输出垃圾,甚至冻结。然而,有一种特殊情况可以让我们得出更强有力的结论:如果我们事先被告知一个程序是一个“是-否”程序,那么我们知道该程序永远不会冻结,也不会产生垃圾——它总是终止并产生输出“是”或“否”。因此,对于“是-否”程序,关于不输出“是”的法律术语等同于关于输出“否”的更简单的陈述。
Admittedly, this description of AntiYesOnSelf.exe is another tongue twister. You might think it would be simpler to rephrase it as “Will the input file, when run on itself, output ‘no'?” Why would that be incorrect? Why do we need the legalese about not outputting “yes,” instead of the simpler statement about outputting “no”? The answer is that programs can sometimes do something other than output “yes” or “no.” So if someone tells us that a certain program does not output “yes,” we can't automatically conclude that it outputs “no.” For example, it might output garbage, or even freeze. However, there is one particular situation in which we can draw a stronger conclusion: if we are told in advance that a program is a yes-no program, then we know that the program never freezes and never produces garbage—it always terminates and produces the output “yes” or “no.” Therefore, for yes-no programs, the legalese about not out-putting “yes” is equivalent to the simpler statement about outputting “no.”
最后,我们可以对 AntiYesOnSelf.exe 的行为进行一个非常简单的描述。每当输入文件是一个“是-否”程序时,AntiYesOnSelf.exe 都会回答这个问题:“输入程序在自身运行时会输出‘否’吗?” 这种对 AntiYesOnSelf.exe 行为的描述在后面非常重要,所以我把它放在了上面的方框里。
Finally, therefore, we can give a very simple description of AntiYesOnSelf.exe's behavior. Whenever the input file is a yes-no program, AntiYesOnSelf.exe answers the question “Will the input program, when run on itself, output ‘no'?” This formulation of AntiYesOnSelf.exe's behavior will be so important later that I have put it in a box above.
鉴于我们已经对 YesOnSelf.exe 的分析工作,绘制 AntiYesOnSelf.exe 的输出表非常容易。实际上,我们可以直接复制第 187 页的表格,将所有输出从“是”改为“否”,反之亦然。这样做就会得到上面的表格。像往常一样,最好逐行浏览一下表格中的内容,并确认您是否同意输出列中的条目。如果输入文件是“是-否”程序,您可以使用上一页方框中的简单公式,而不必使用前面给出的更复杂的公式。
Given the work we've done already to analyze YesOnSelf.exe, it is particularly easy to draw up a table of outputs for AntiYesOnSelf.exe. In fact, we can just copy the table on page 187, switching all the outputs from “yes” to “no” and vice versa. Doing this produces the table above. As usual, it would be a good idea to run through each line in this table, and verify that you agree with the entries in the output column. Whenever the input file is a yes-no program, you can use the simple formulation in the box on the previous page, instead of working through the more complicated one given earlier.
AntiYesOnSelf.exe 输出
AntiYesOnSelf.exe outputs
| 输入文件 | 输出 |
| 地址列表.docx | 是的 |
| 我的电影.mpg | 是的 |
| 运行Word程序 | 是的 |
| 程序A.exe | 不 |
| 程序B.exe | 是的 |
| 名称大小工具 | 不 |
| 尺寸检查器 | 不 |
| 冻结程序 | 是的 |
| AlwaysYes.exe | 是的 |
| AntiYesOnSelf.exe | ??? |
AntiYesOnSelf.exe 对各种输入的输出。根据定义,AntiYesOnSelf.exe 会给出与 YesOnSelf.exe 相反的答案,因此该表(除了最后一行)与第 187 页的表完全相同,只是输出从“是”调换为“否”,反之亦然。最后一行带来了一个严重的难题,正如文中所述。
The outputs of AntiYesOnSelf.exe for various inputs. By definition, AntiYesOnSelf.exe produces the opposite answer to YesOnSelf.exe, so this table—except for its last row—is identical to the one on page 187, but with the outputs switched from “yes” to “no” and vice versa. The last row presents a grave difficulty, as discussed in the text.
从表格的最后一行可以看出,当我们尝试计算 AntiYesOnSelf.exe 自身的输出时,出现了一个问题。为了帮助分析这个问题,我们进一步简化上一页方框中给出的 AntiYesOnSelf.exe 的描述:我们不再考虑所有可能的“是-否”程序作为输入,而是专注于当 AntiYesOnSelf.exe 自身作为输入时会发生什么。因此,该方框中粗体显示的问题“输入程序是否……?”可以改写为“AntiYesOnSelf.exe 是否……”——因为输入程序就是AntiYesOnSelf.exe。这是我们最终需要的公式,因此它也显示在下一页的方框中。
As you can see from the last row of the table, a problem arises when we try to compute the output of AntiYesOnSelf.exe on itself. To help us analyze this, let's further simplify the description of AntiYesOnSelf.exe given in the box on the previous page: instead of considering all possible yes-no programs as inputs, we'll concentrate on what happens when AntiYesOnSelf.exe is given itself as input. So the question in bold in that box, “Will the input program,…,” can be rephrased as “Will AntiYesOnSelf.exe,.”—because the input program is AntiYesOnSelf.exe. This is the final formulation we will need, so it is also presented in a box on the facing page.
现在我们准备计算一下 AntiYesOnSelf.exe 自身的输出。只有两种可能性(“是”和“否”),所以计算起来应该不难。我们只需依次处理每种情况即可:
Now we're ready to work out the output of AntiYesOnSelf.exe on itself. There are only two possibilities (“yes” and “no”), so it shouldn't be too hard to work through this. We'll just deal with each of the cases in turn:
AntiYesOnSelf.exe 在将自身作为输入时,会回答以下问题:
AntiYesOnSelf.exe, when given itself as input, answers the question:
AntiYesOnSelf.exe 在自身运行时会输出“否”吗?
Will AntiYesOnSelf.exe, when run on itself, output “no”?
简要描述 AntiYesOnSelf.exe 在自身作为输入时的行为。请注意,此框只是第 189 页框的简化版本,专门针对输入文件为 AntiYesOn-Self.exe 的情况。
A concise description of the behavior of AntiYesOnSelf.exe when given itself as input. Note that this box is just a simplified version of the box on page 189, specialized to the single case that the input file is AntiYesOn-Self.exe.
情况 1(输出为“是”):如果输出为“是”,那么上面方框中粗体问题的答案就是“否”。但根据定义,粗体问题的答案是 AntiYesOnSelf.exe 的输出(请再次阅读整个方框以确认这一点)——因此,输出必然为“否”。总而言之,我们刚刚证明了,如果输出为“是”,那么输出也为“否”。这不可能!事实上,我们得出了一个矛盾。(如果您不熟悉反证法,现在是回顾本章前面关于这个主题的讨论的好时机。我们将在接下来的几页中反复使用这种技巧。)因为我们得出了一个矛盾,所以我们假设输出为“是”的假设必然无效。我们已经证明了,AntiYesOnSelf.exe 在自身上运行的输出不可能为“是”。那么,让我们继续讨论另一种可能性。
Case 1 (output is “yes”): If the output is “yes,” then the answer to the question in bold in the box above is “no.” But the answer to the bold question is, by definition, the output of AntiYesOnSelf.exe (read the whole box again to convince yourself of this)—and therefore, the output must be “no.” To summarize, we just proved that if the output is “yes,” then the output is “no.” Impossible! In fact, we have arrived at a contradiction. (If you're not familiar with the technique of proof by contradiction, this would be a good time to go back and review the discussion of this topic earlier in this chapter. We'll be using the technique repeatedly in the next few pages.) Because we obtained a contradiction, our assumption that the output is “yes” must be invalid. We have proved that the output of AntiYesOnSelf.exe, when run on itself, cannot be “yes.” So let's move on to the other possibility.
情况 2(输出为“否”):如果输出为“否”,则上方方框中粗体问题的答案为“是”。但是,与情况 1 一样,根据定义,粗体问题的答案是 AntiYesOnSelf.exe 的输出,因此输出必然为“是”。换句话说,我们刚刚证明了,如果输出为“否”,则输出为“是”。我们再次得到了矛盾,因此我们关于输出为“否”的假设必然不成立。我们已经证明了,AntiYesOnSelf.exe 在自身运行时的输出不可能为“否”。
Case 2 (output is “no”): If the output is “no,” then the answer to the question in bold in the box above is “yes.” But, just as in case 1, the answer to the bold question is, by definition, the output of AntiYesOnSelf.exe—and, therefore, the output must be “yes.” In other words, we just proved that if the output is “no,” then the output is “yes.” Once again, we have obtained a contradiction, so our assumption that the output is “no” must be invalid. We have proved that the output of AntiYesOnSelf.exe, when run on itself, cannot be “no.”
那么现在怎么办?我们已经排除了 AntiYesOnSelf.exe 自身运行时仅有的两种输出可能性。这同样存在矛盾:AntiYesOnSelf.exe 被定义为一个“是-否”程序——一个总是终止并产生“是”或“否”之一输出的程序。然而,我们刚刚演示了一个特定的输入,AntiYesOnSelf.exe 不会产生这两个输出!这个矛盾意味着我们最初的假设是错误的:因此,不可能编写一个像 AntiYesOnSelf.exe 那样行为的“是-否”程序。
So what now? We have eliminated the only two possibilities for the output of AntiYesOnSelf.exe when run on itself. This too is a contradiction: AntiYesOnSelf.exe was defined to be a yes-no program—a program that always terminates and produces one of the two outputs “yes” or “no.” And yet we just demonstrated a particular input for which AntiYesOnSelf.exe does not produce either of these outputs! This contradiction implies that our initial assumption was false: thus, it is not possible to write a yes-no program that behaves like AntiYesOnSelf.exe.
现在你应该明白为什么我如此谨慎地坦诚承认我实际上并没有编写任何程序 AlwaysYes.exe、YesOn-Self.exe 或 AntiYesOnSelf.exe。我只是描述了如果我编写了这些程序,它们会如何运行。在上一段中,我们用反证法证明了 AntiYesOnSelf.exe 不可能存在。但我们可以证明更多:AlwaysYes.exe 和 YesOnSelf.exe 也根本不可能存在!为什么呢?你可能已经猜到了,反证法再次成为了关键工具。回想一下我们在第 188 页讨论过的内容:如果 AlwaysYes.exe 存在,那么对其进行一些小改动就能很容易地生成 YesOnSelf.exe。如果 YesOnSelf.exe 存在,那么生成 AntiYesOnSelf.exe 也极其容易,因为我们只需反转输出(“是”替换“否”,反之亦然)。总而言之,如果 AlwaysYes.exe 存在,那么 AntiYesOnSelf.exe 也存在。但我们已经知道 AntiYesOnSelf.exe 不可能存在,因此 AlwaysYes.exe 也不可能存在。同样的论证也表明 YesOnSelf.exe 也不可能存在。
Now you will see why I was very careful to be honest and admit that I did not actually write any of the programs AlwaysYes.exe, YesOn-Self.exe, or AntiYesOnSelf.exe. All I did was describe how these programs would behave if I did write them. In the last paragraph, we used proof by contradiction to show that AntiYesOnSelf.exe cannot exist. But we can prove even more: the existence of AlwaysYes.exe and YesOnSelf.exe is also impossible! Why is this? As you can probably guess, proof by contradiction is again the key tool. Recall how we discussed, on page 188, that if AlwaysYes.exe existed, it would be easy to make a few small changes to it and produce YesOnSelf.exe. And if YesOnSelf.exe existed, it would be extremely easy to produce AntiYesOnSelf.exe, since we just have to reverse the outputs (“yes” instead of “no,” and vice versa). In summary, if AlwaysYes.exe exists, then so does AntiYesOnSelf.exe. But we already know that AntiYesOnSelf.exe can't exist, and, therefore, AlwaysYes.exe can't exist either. The same argument shows that YesOnSelf.exe is also an impossibility.
请记住,本节只是我们最终目标的垫脚石,即证明崩溃查找程序不可能存在。本节更温和的目标是给出一些不可能存在的程序的例子。我们通过研究三个不同的程序(每个程序都不可能存在)实现了这一目标。在这三个程序中,最有趣的是 AlwaysYes.exe。另外两个程序则比较晦涩,因为它们专注于研究以自身作为输入的程序的行为。另一方面,AlwaysYes.exe 是一个非常强大的程序,因为如果它存在,它可以分析任何其他程序并告诉我们该程序是否总是输出“是”。但正如我们现在所见,没有人能够编写出如此聪明且听起来有用的程序。
Remember, this whole section was just a steppingstone toward our final goal of proving that crash-finding programs are impossible. The more modest goal in this section was to give some examples of programs that cannot exist. We've achieved this by examining three different programs, each of which is impossible. Of these three, the most interesting is AlwaysYes.exe. The other two are rather obscure, in that they concentrate on the behavior of programs that are given themselves as input. AlwaysYes.exe, on the other hand, is a very powerful program, since if it existed, it could analyze any other program and tell us whether that program always outputs “yes.” But as we've now seen, no one will ever be able to write such a clever and useful-sounding program.
发现崩溃的不可能性
THE IMPOSSIBILITY OF FINDING CRASHES
我们终于可以开始证明一个程序能够成功分析其他程序并判断它们是否崩溃了:具体来说,我们将证明这样的程序不可能存在。读完前几页,你可能已经猜到我们将使用反证法。也就是说,我们首先假设我们的“圣杯”存在:存在一个名为 CanCrash.exe 的程序,它可以分析其他程序并判断它们是否崩溃。在对 CanCrash.exe 进行一些奇怪、神秘而又奇妙的操作之后,我们将得出一个矛盾。
We are finally ready to begin a proof about a program that successfully analyzes other programs and determines whether or not they crash: specifically, we will be proving that such a program cannot exist. After reading the last few pages, you have probably guessed that we will be using proof by contradiction. That is, we will start off by assuming that our holy grail exists: there is some program called CanCrash.exe which can analyze other programs and tell us whether or not they can crash. After doing some strange, mysterious, and delightful things to CanCrash.exe, we will arrive at a contradiction.
某个特定操作系统崩溃的结果。不同的操作系统处理崩溃的方式不同,但我们一眼就能看出是哪一种。这个 TroubleMaker.exe 程序是故意编写的,目的就是造成崩溃,这表明故意制造崩溃其实很容易。
The result of a crash on one particular operating system. Different operating systems handle crashes in different ways, but we all know one when we see one. This TroubleMaker.exe program was deliberately written to cause a crash, demonstrating that intentional crashes are easy to achieve.
证明中的一个步骤要求我们修改一个完好的程序,使其在某些情况下崩溃。我们该怎么做呢?其实很简单。程序崩溃可能由许多不同的原因引起。其中最常见的一种是程序尝试除以零。在数学中,将任何数字除以零的结果称为“未定义”。在计算机中,“未定义”是一个严重错误,程序无法继续运行,因此会崩溃。因此,故意让程序崩溃的一个简单方法是在程序中插入一些额外的指令,这些指令会将数字除以零。事实上,这正是我在上图中生成 TroubleMaker.exe 示例的方法。
One of the steps in the proof requires us to take a perfectly good program and alter it so that it deliberately crashes under certain circumstances. How can we do such a thing? It is, in fact, very easy. Program crashes can arise from many different causes. One of the more common is when the program tries to divide by zero. In mathematics, the result of taking any number and dividing it by zero is called “undefined.” In a computer, “undefined” is a serious error and the program cannot continue, so it crashes. Therefore, one simple way to make a program crash deliberately is to insert a couple of extra instructions into the program that will divide a number by zero. In fact, that is exactly how I produced the TroubleMaker.exe example in the figure above.
现在我们开始证明崩溃查找程序不可能存在。下一页的图表总结了论证流程。我们首先假设存在 CanCrash.exe,它是一个“是-否”程序,它总是终止,如果输入的程序在任何情况下都可能崩溃,则输出“是”;如果输入的程序永远不会崩溃,则输出“否”。
Now we begin the main proof of the impossibility of a crash-finding program. The figure on the following page summarizes the flow of the argument. We start off assuming the existence of CanCrash.exe, which is a yes-no program that always terminates, outputting “yes” if the program it receives as input can ever crash under any circumstances, and outputting “no” if the input program never crashes.
现在我们对 CanCrash.exe 做了一个有点奇怪的改动:我们不再输出“yes”,而是让它崩溃!(如上所述,故意除以零很容易做到这一点。)我们将生成的程序命名为 CanCrashWeird.exe。因此,如果输入可以崩溃,这个程序就会故意崩溃——导致出现类似上面的对话框;如果输入永不崩溃,它就会输出“no”。
Now we make a somewhat weird change to CanCrash.exe: instead of outputting “yes,” we will make it crash instead! (As discussed above, it's easy to do this by deliberately dividing by zero.) Let's call the resulting program CanCrashWeird.exe. So this program deliberately crashes—causing the appearance of a dialog box similar to the one above—if its input can crash, and outputs “no” if its input never crashes.
图中所示的下一步是将 CanCrash-Weird.exe 转换为一个更隐晦的程序,名为 CrashOnSelf.exe。这个程序与上一节中的 YesOnSelf.exe 类似,只关注程序在自身作为输入时的行为。具体来说,CrashOnSelf.exe 会检查输入程序,如果该程序在自身运行时会崩溃,它就会故意使其崩溃。否则,它会输出“no”。需要注意的是,从 CanCrashWeird.exe 生成 CrashOnSelf.exe 非常简单:其过程与我们在第 188 页讨论过的将 AlwaysYes.exe 转换为 YesOnSelf.exe 的过程完全相同。
The next step shown in the figure is to transform CanCrash-Weird.exe into a more obscure beast called CrashOnSelf.exe. This program, just like YesOnSelf.exe in the last section, is concerned only with how programs behave when given themselves as input. Specifically, CrashOnSelf.exe examines the input program it is given and deliberately crashes if that program would crash when run on itself. Otherwise, it outputs “no.” Note that it's easy to produce CrashOnSelf.exe from CanCrashWeird.exe: the procedure is exactly the same as the one for transforming AlwaysYes.exe into YesOnSelf.exe, which we discussed on page 188.
四个崩溃检测程序组成的序列不可能存在。最后一个程序 AntiCrashOnSelf.exe 显然不可能存在,因为它在自身运行时会产生矛盾。然而,每个程序都可以通过对其上方的程序进行微小更改(箭头所示)轻松生成。因此,这四个程序都不可能存在。
A sequence of four crash-detecting programs that cannot exist. The last program, AntiCrashOnSelf.exe, is obviously impossible, since it produces a contradiction when run on itself. However, each of the programs can be produced easily by a small change to the one above it (shown by the arrows). Therefore, none of the four programs can exist.
图中四个程序序列的最后一步是将 CrashOnSelf.exe 转换为 AntiCrashOnSelf.exe。这个简单的步骤只是反转了程序的行为:如果输入在自身运行时崩溃,AntiCrashOnSelf.exe 会输出“yes”。但如果输入在自身运行时没有崩溃,AntiCrashOnSelf.exe 就会故意崩溃。
The final step in the sequence of the four programs in the figure is to transform CrashOnSelf.exe into AntiCrashOnSelf.exe. This simple step just reverses the behavior of the program: so if its input crashes when run on itself, AntiCrashOnSelf.exe outputs “yes.” But if the input doesn't crash when run on itself, AntiCrashOnSelf.exe deliberately crashes.
现在我们到了可以产生矛盾的地步。当 AntiCrashOnSelf.exe 自身作为输入时,它会做什么?根据它自己的描述,如果它崩溃了,它应该输出“yes”(这很矛盾,因为如果它已经崩溃,就无法通过输出“yes”成功终止)。同样,根据它自己的描述,即使 AntiCrashOnSelf.exe 没有崩溃,它也应该崩溃——这同样自相矛盾。我们排除了 Anti-CrashOnSelf.exe 的两种可能行为,这意味着该程序根本不可能存在。
Now we've arrived at a point where we can produce a contradiction. What will AntiCrashOnSelf.exe do when given itself as input? According to its own description, it should output “yes” if it crashes (a contradiction, since it can't terminate successfully with the output “yes” if it has already crashed). And again according to its own description, AntiCrashOnSelf.exe should crash if it doesn't crash—which is also self-contradictory. We've eliminated both possible behaviors of Anti-CrashOnSelf.exe, which means the program could not have existed in the first place.
最后,我们可以使用上页图中所示的变换链来证明 CanCrash.exe 也不存在。如果它存在,我们可以按照图中的箭头将其转换为 AntiCrashOnSelf.exe——但我们已经知道 Anti-CrashOnSelf.exe 不可能存在。这本身就自相矛盾,因此,我们关于 CanCrash.exe 存在的假设必然是错误的。
Finally, we can use the chain of transformations shown in the figure on the facing page to prove that CanCrash.exe can't exist either. If it did exist, we could transform it, by following the arrows in the figure, into AntiCrashOnSelf.exe—but we already know Anti-CrashOnSelf.exe can't exist. That's a contradiction, and, therefore, our assumption that CanCrash.exe exists must be false.
停机问题和不可判定性
The Halting Problem and Undecidability
至此,我们对计算机科学领域最复杂、最深刻的成果之一的探索就此结束。我们证明了,任何人都不可能编写出像 CanCrash.exe 这样的计算机程序:一个能够分析其他程序并识别出其中所有可能导致程序崩溃的 bug 的程序。
That concludes our tour through one of the most sophisticated and profound results in computer science. We have proved the absolute impossibility that anyone will ever write a computer program like CanCrash.exe: a program that analyzes other programs and identifies all possible bugs in those programs that might cause them to crash.
事实上,当理论计算机科学的创始人艾伦·图灵在 20 世纪 30 年代首次证明这样的结果时,他根本不担心错误或崩溃。毕竟,那时还没有电子计算机被制造出来。相反,图灵感兴趣的是给定的计算机程序最终是否会给出答案。一个密切相关的问题是:给定的计算机程序是否会终止?或者,它会永远计算而不给出答案吗?给定的计算机程序是否最终会终止或“停止”的问题被称为停机问题。图灵的伟大成就是证明了他的停机问题变体是计算机科学家所说的“不可判定”问题。不可判定问题是无法通过编写计算机程序解决的问题。因此,另一种表述图灵结果的方式是:你不能编写一个名为 AlwaysHalts.exe 的计算机程序,如果其输入总是停止,则输出“是”,否则输出“否”。
In fact, when Alan Turing, the founder of theoretical computer science, first proved a result like this in the 1930s, he wasn't concerned at all about bugs or crashes. After all, no electronic computer had even been built yet. Instead, Turing was interested in whether or not a given computer program would eventually produce an answer. A closely related question is: will a given computer program ever terminate—or, alternatively, will it go on computing forever, without producing an answer? This question of whether a given computer program will eventually terminate, or “halt,” is known as the Halting Problem. Turing's great achievement was to prove that his variant of the Halting Problem is what computer scientists call “undecidable.” An undecidable problem is one that can't be solved by writing a computer program. So another way of stating Turing's result is: you can't write a computer program called AlwaysHalts.exe, that outputs “yes” if its input always halts, and “no” otherwise.
从这个角度来看,停机问题与本章讨论的问题(我们或许可以称之为崩溃问题)非常相似。我们证明了崩溃问题的不可判定性,但你可以使用本质上相同的技术来证明停机问题也是不可判定的。而且,正如你可能猜到的,计算机科学中还有许多其他不可判定的问题。
Viewed in this way, the Halting Problem is very similar to the problem tackled in this chapter, which we might call the Crashing Problem. We proved the undecidability of the Crashing Problem, but you can use essentially the same technique to prove the Halting Problem is also undecidable. And, as you might guess, there are many other problems in computer science that are undecidable.
不可能计划的含义是什么?
WHAT ARE THE IMPLICATIONS OF IMPOSSIBLE PROGRAMS?
除结论外,这是本书的最后一章。我特意将它纳入其中,是为了与前面的章节形成对比。尽管之前的每一章都倡导着一个非凡的想法,使计算机更加强大,对人类更加有用,但在本章中,我们看到了计算机的一个根本局限性。我们发现,有些问题实际上是计算机无法解决的,无论计算机多么强大,或者它的人类程序员多么聪明。这些不可判定的问题中也包含一些潜在的有用任务,例如分析其他计算机程序以确定它们是否可能崩溃。
Except for the conclusion, this is the last chapter in the book. I included it as a deliberate counterpoint to the earlier chapters. Whereas every previous chapter championed a remarkable idea that renders computers even more powerful and useful to us humans, in this chapter we saw one of the fundamental limitations of computers. We saw there are some problems that are literally impossible to solve with a computer, regardless of how powerful the computer is or how clever its human programmer. And these undecidable problems include potentially useful tasks, such as analyzing other computer programs to find out whether they might crash.
这个奇怪甚至令人担忧的事实意味着什么?不可判定问题的存在是否会影响我们在实际中使用计算机的方式?那么,我们人类大脑内部的计算是否也阻碍了我们处理不可判定问题?
What is the significance of this strange, and perhaps even foreboding, fact? Does the existence of undecidable problems affect the way we use computers in practice? And how about the computations that we humans do inside our brains—are those also prevented from tackling undecidable problems?
不确定性和计算机使用
Undecidability and Computer Use
让我们首先讨论一下不可判定性对计算机使用的实际影响。简而言之:不,不可判定性对日常计算实践没有太大影响。这有两个原因。首先,不可判定性只关心计算机程序是否会给出答案,并不考虑我们需要等待多长时间才能得到答案。然而在实践中,效率问题(换句话说,你需要等待多长时间才能得到答案)极其重要。有很多可判定任务目前还没有有效的算法。其中最著名的就是旅行商问题,简称 TSP。用现代术语来说,TSP 大致如下:假设你必须飞往许多城市(比如 20 个、30 个或 100 个)。你应该按照什么顺序访问这些城市才能花费尽可能低的总机票价格?正如我们已经指出的,这个问题是可判定的——事实上,一个只有几天经验的新手程序员就可以编写一个计算机程序来找到穿过这些城市的最便宜路线。问题是,这个程序可能需要数百万年才能完成它的工作。在实践中,这还不够好。因此,仅仅因为一个问题是可判定的,并不意味着我们可以在实践中解决它。
Let's first address the practical effects of undecidability on computer use. The short answer is: no, undecidability does not have much effect on the daily practice of computing. There are two reasons for this. Firstly, undecidability is concerned only with whether a computer program will ever produce an answer, and does not consider how long we have to wait for that answer. In practice, however, the issue of efficiency (in other words, how long you have to wait for the answer) is extremely important. There are plenty of decidable tasks for which no efficient algorithm is known. The most famous of these is the Traveling Salesman Problem, or TSP for short. Restated in modern terminology, the TSP goes something like this: suppose you have to fly to a large number of cities (say, 20 or 30 or 100). In what order should you visit the cities so as to incur the lowest possible total airfare? As we noted already, this problem is decidable—in fact, a novice programmer with only a few days' experience can write a computer program to find the cheapest route through the cities. The catch is that the program could take millions of years to complete its job. In practice, this isn't good enough. Thus, the mere fact that a problem is decidable does not mean that we can solve it in practice.
现在谈谈不可判定性在实际效果上有限的第二个原因:事实证明,大多数时候我们通常可以很好地解决不可判定的问题。本章的主要示例就很好地说明了这一点。我们遵循了一个精心设计的证明,表明没有任何计算机程序能够找到所有计算机程序中的所有错误。但是我们仍然可以尝试编写一个崩溃查找程序,希望它能够找到大多数类型计算机程序中的大多数错误。这确实是计算机科学中一个非常活跃的研究领域。过去几十年来,我们看到的软件可靠性的提高部分归功于崩溃查找程序的进步。因此,通常可以针对不可判定问题产生非常有用的部分解决方案。
Now for the second reason that undecidability has limited practical effects: it turns out that we can often do a good job of solving unde-cidable problems most of the time. The main example of the current chapter is an excellent illustration of this. We followed an elaborate proof showing that no computer program can ever be capable of finding all the bugs in all computer programs. But we can still try to write a crash-finding program, hoping to make it find most of the bugs in most types of computer programs. This is, indeed, a very active area of research in computer science. The improvements we've seen in software reliability over the last few decades are partly due to the advances made in crash-finding programs. Thus, it is often possible to produce very useful partial solutions to undecidable problems.
不确定性与大脑
Undecidability and the Brain
不可判定问题的存在是否会对人类的思维过程产生影响?这个问题直接引出了哲学中一些经典问题的深层含义,例如意识的定义以及心灵与大脑的区别。然而,我们可以清楚一点:如果你相信人脑原则上可以被计算机模拟,那么人脑也会受到与计算机相同的限制。换句话说,存在一些人脑无法解决的问题——无论人脑多么聪明或训练有素。这个结论直接源于本章的主要结果。如果大脑可以被计算机程序模仿,并且大脑可以解决不可判定问题,那么我们也可以使用计算机模拟大脑来解决不可判定问题——这与计算机程序无法解决不可判定问题的事实相矛盾。
Does the existence of undecidable problems have implications for human thought processes? This question leads directly to the murky depths of some classic problems in philosophy, such as the definition of consciousness and the distinction between mind and brain. Nevertheless, we can be clear about one thing: if you believe that the human brain could, in principle, be simulated by a computer, then the brain is subject to the same limitations as computers. In other words, there would be problems that no human brain could solve—however intelligent or well-trained that brain might be. This conclusion follows immediately from the main result in this chapter. If the brain can be imitated by a computer program, and the brain can solve undecidable problems, then we could use a computer simulation of the brain to solve the undecidable problems also—contradicting the fact that computer programs cannot solve undecidable problems.
当然,我们能否最终实现对大脑的精确计算机模拟,这个问题远未得到解决。从科学的角度来看,这似乎并不存在任何根本性的障碍,因为关于化学和电信号在大脑中传输的底层细节,人们已经相当了解。另一方面,各种哲学论证表明,大脑的物理过程以某种方式创造了一种“心智”,这种“心智”在本质上与任何能够被计算机模拟的物理系统都截然不同。这些哲学论证形式多样,例如,可以基于我们自身的自我反思和直觉能力,也可以基于对灵性的诉求。
Of course, the question of whether we will ever be able to perform accurate computer simulations of the brain is far from settled. From a scientific point of view, there do not seem to be any fundamental barriers, since the low-level details of how chemical and electrical signals are transmitted in the brain are reasonably well understood. On the other hand, various philosophical arguments suggest that somehow the physical processes of the brain create a “mind” that is qualitatively different from any physical system that could be simulated by computer. These philosophical arguments take many forms and can be based, for example, on our own capacity for self-reflection and intuition, or an appeal to spirituality.
这里与艾伦·图灵 1937 年关于不可判定性的论文有着有趣的联系——这篇论文被许多人视为计算机科学学科的基础。不幸的是,这篇论文的标题相当晦涩难懂:它以听起来无害的短语“论可计算数……”开头,却以刺耳的“……及其在判定性问题中的应用”结尾。(我们这里不讨论标题的第二部分!)我们必须认识到,在 20 世纪 30 年代,“计算机”一词的含义与我们今天使用它的方式完全不同。对图灵来说,“计算机”就是用铅笔和纸进行某种计算的人。因此,图灵论文标题中的“可计算数”是原则上可以由人类计算的数字。但为了支持他的论证,图灵描述了一种特殊类型的机器(对图灵来说,“机器”就是我们今天所说的“计算机”),这种机器也能进行计算。论文的一部分内容致力于证明某些计算无法由这些机器完成——这就是不可判定性的证明,我们已经详细讨论过。但同一篇论文的另一部分则详细而令人信服地论证了图灵的“机器”(即计算机)可以执行任何由“计算机”(即人类)完成的计算。
There is a fascinating connection here to Alan Turing's 1937 paper on undecidability—a paper that is regarded by many as the foundation of computer science as a discipline. The paper's title is, unfortunately, rather obscure: it begins with the innocuous-sounding phrase “On computable numbers…” but ends with the jarring “…with an application to the Entscheidungsproblem.” (We won't be concerning ourselves with the second part of the title here!) It is crucial to realize that in the 1930s, the word “computer” had a completely different meaning, compared to the way we use it today. For Turing, a “computer” was a human, doing some kind of calculation using a pencil and paper. Thus, the “computable numbers” in the title of Turing's paper are the numbers that could, in principle, be calculated by a human. But to assist his argument, Turing describes a particular type of machine (for Turing, a “machine” is what we would call a “computer” today) that can also do calculations. Part of the paper is devoted to demonstrating that certain calculations cannot be performed by these machines—this is the proof of undecidability, which we have discussed in detail already. But another part of the same paper makes a detailed and compelling argument that Turing's “machine” (read: computer) can perform any calculation done by a “computer” (read: human).
您可能开始理解为什么图灵的《论可计算数》论文的开创性意义无论怎样强调都不过分。它不仅定义并解决了计算机科学中一些最基本的问题,而且触及了哲学雷区的核心,令人信服地证明了人类的思维过程可以被计算机模拟(请记住,当时计算机尚未发明!)。在现代哲学用语中,这种认为所有计算机,甚至人类,都具有同等计算能力的观点被称为“丘奇-图灵论题” 。该论题同时认可了艾伦·图灵和阿隆佐·丘奇,后者(如前所述)分别独立发现了不可判定问题的存在。事实上,丘奇在图灵发表论文几个月前就发表了,但丘奇的表述更为抽象,并没有明确提及机器计算。
You may be beginning to appreciate why it is difficult to overstate the seminal nature of Turing's “On computable numbers.” paper. It not only defines and solves some of the most fundamental problems in computer science, but also strikes out into the heart of a philosophical minefield, making a persuasive case that human thought processes could be emulated by computers (which, remember, had not been invented yet!). In modern philosophical parlance, this notion—that all computers, and probably humans too, have equivalent computational power—is known as the Church-Turing thesis. The name acknowledges both Alan Turing and Alonzo Church, who (as mentioned earlier) independently discovered the existence of undecidable problems. In fact, Church published his work a few months before Turing, but Church's formulation is more abstract and does not explicitly mention computation by machines.
关于丘奇-图灵论题有效性的争论仍在继续。但如果其最强版本成立,那么我们的计算机并非唯一受制于不可判定性限制的生物。同样的限制不仅适用于我们指尖的天才,也适用于它们背后的天才:我们自己的思维。
The debate over the validity of the Church-Turing thesis rages on. But if its strongest version holds, then our computers aren't the only ones humbled by the limits of undecidability. The same limits would apply not only to the genius at our fingertips, but the genius behind them: our own minds.
11
11
结论:您能获得更多天才吗?
Conclusion: More Genius at Your Fingertips?
—A LAN T URING,《计算机器与智能》,1950 年
—ALAN TURING, Computing Machinery and Intelligence, 1950
1991年,我有幸聆听了伟大的理论物理学家斯蒂芬·霍金的一场公开演讲。这场题为“宇宙的未来”的演讲,霍金自信地预测宇宙至少在未来100亿年内都会持续膨胀。他不无讽刺地补充道:“我不指望我能活到被证明是错的。” 对我来说,不幸的是,计算机科学的预测不像宇宙学家那样拥有100亿年的保障。我做出的任何预测都可能在我有生之年被推翻。
I was fortunate, in 1991, to attend a public lecture by the great theoretical physicist Stephen Hawking. During the lecture, which was boldly titled “The Future of the Universe,” Hawking confidently predicted that the universe would keep expanding for at least the next 10 billion years. He wryly added, “I don't expect to be around to be proved wrong.” Unfortunately for me, predictions about computer science do not come with the same 10-billion-year insurance policy that is available to cosmologists. Any predictions I make may well be disproved during my own lifetime.
但这不应阻止我们思考计算机科学伟大思想的未来。我们探索过的那些伟大算法会永远“伟大”吗?有些算法会过时吗?还会有新的伟大算法出现吗?为了解答这些问题,我们需要更多地像历史学家而不是宇宙学家那样思考。这让我想起多年前的另一段经历,当时我观看了备受赞誉却又颇具争议的牛津大学历史学家AJP·泰勒(AJP Taylor)的电视讲座。在系列讲座的最后,泰勒直接回答了是否会发生第三次世界大战的问题。他认为答案是肯定的,因为人类很可能“在未来的行为与过去一样”。
But that shouldn't stop us thinking about the future of the great ideas of computer science. Will the great algorithms we've explored remain “great” forever? Will some become obsolete? Will new great algorithms emerge? To address these questions, we need to think less like a cosmologist and more like a historian. This brings to mind another experience I had many years ago, watching some televised lectures by the acclaimed, if controversial, Oxford historian A. J. P. Taylor. At the end of the lecture series, Taylor directly addressed the question of whether there would ever be a third world war. He thought the answer was yes, because humans would probably “behave in the future as they have done in the past.”
因此,让我们追随 AJP Taylor 的脚步,向浩瀚的历史长河致敬。本书所描述的伟大算法源于 20 世纪零星发生的事件和发明。我们似乎可以合理地假设 21 世纪也会出现类似的发展速度,每隔二三十年就会出现一组重要的新算法。在某些情况下,这些算法可能是科学家们构思出的令人惊叹的原创、全新的技术。公钥密码学和相关的数字签名算法就是其中的例子。在其他情况下,这些算法可能已经在研究界存在了一段时间,等待着合适的新技术浪潮到来,使其获得广泛的应用。用于索引和排名的搜索算法就属于这一类:类似的算法在信息检索领域已经存在多年,但直到网络搜索的出现,这些算法才真正“伟大”,被普通计算机用户日常使用。当然,这些算法也因其新的应用而不断发展;PageRank 就是一个很好的例子。
So let's follow A. J. P. Taylor's lead and bow to the broad sweep of history. The great algorithms described in this book arose from incidents and inventions sprinkled throughout the 20th century. It seems reasonable to assume a similar pace for the 21st century, with a major new set of algorithms coming to the fore every two or three decades. In some cases, these algorithms could be stunningly original, completely new techniques dreamed up by scientists. Public key cryptography and the related digital signature algorithms are examples of this. In other cases, the algorithms may have existed in the research community for some time, waiting in the wings for the right wave of new technology to give them wide applicability. The search algorithms for indexing and ranking fall into this category: similar algorithms had existed for years in the field known as information retrieval, but it took the phenomenon of web search to make these algorithms “great,” in the sense of daily use by ordinary computer users. Of course, the algorithms also evolved for their new application; PageRank is a good example of this.
需要注意的是,新技术的出现并不一定会导致新的算法。想想 20 世纪 80 年代和 90 年代笔记本电脑的惊人增长。笔记本电脑通过大幅提升可访问性和便携性,彻底改变了人们使用电脑的方式。此外,笔记本电脑还在屏幕技术和电源管理技术等多个领域推动了极其重要的进步。但我认为,笔记本电脑革命并没有催生出伟大的算法。相比之下,互联网的出现却催生了伟大的算法:通过提供搜索引擎赖以生存的基础设施,互联网推动了索引和排名算法的进化,使其日臻完善。
Note that the emergence of new technology does not necessarily lead to new algorithms. Consider the phenomenal growth of laptop computers over the 1980s and 1990s. Laptops revolutionized the way people use computers, by vastly increasing accessibility and portability. And laptops also motivated hugely important advances in such diverse areas as screen technology and power management techniques. But I would argue that no great algorithms emerged from the laptop revolution. In contrast, the emergence of the internet is a technology that did lead to great algorithms: by providing an infrastructure in which search engines could exist, the internet allowed indexing and ranking algorithms to evolve toward greatness.
因此,我们周围持续呈现的技术加速发展无疑并不能保证新的伟大算法的出现。事实上,一股强大的历史力量正在朝着相反的方向发展,表明未来算法创新的速度甚至可能会减缓。我指的是计算机科学作为一门科学学科正在开始成熟。与物理、数学和化学等领域相比,计算机科学非常年轻:它起源于20世纪30年代。因此,可以说,20世纪发现的伟大算法可能只是唾手可得的成果,而未来要找到巧妙且广泛适用的算法将变得越来越困难。
Therefore, the undoubted acceleration of technology growth that continues to unfold around us does not, in and of itself, guarantee the emergence of new great algorithms. In fact, there is a powerful historical force operating in the other direction, suggesting that the pace of algorithmic innovation will, if anything, decrease in the future. I'm referring to the fact that computer science is beginning to mature as a scientific discipline. Compared to fields such as physics, mathematics, and chemistry, computer science is very young: it has its beginnings in the 1930s. Arguably, therefore, the great algorithms discovered in the 20th century may have consisted of low hanging fruit, and it will become more and more difficult to find ingenious, widely applicable algorithms in the future.
因此,我们面临着两种相互竞争的效应:新技术带来的新利基市场偶尔会为新算法提供发展空间,而该领域的日益成熟则会缩小机会。总的来说,我倾向于认为这两种效应会相互抵消,导致未来几年缓慢但稳定地涌现出新的优秀算法。
So we have two competing effects: new niches provided by new technology occasionally provide scope for new algorithms, while the increasing maturity of the field narrows the opportunities. On balance, I tend to think that these two effects will cancel each other out, leading to a slow but steady emergence of new great algorithms in the years ahead.
一些潜在的伟大算法
SOME POTENTIALLY GREAT ALGORITHMS
当然,其中一些新算法可能完全出乎意料,这里无法详述。但现有的一些领域和技术显然具有潜力。其中一个明显的趋势是人工智能(尤其是模式识别)在日常生活中的应用日益广泛,而观察这一领域是否会涌现出一些令人瞩目的算法瑰宝,将会非常有趣。
Of course, some of these new algorithms will be completely unexpected, and it's impossible to say anything more about them here. But there are existing niches and techniques that have clear potential. One of the obvious trends is the increasing use of artificial intelligence (and, in particular, pattern recognition) in everyday contexts, and it will be fascinating to see if any strikingly novel algorithmic gems emerge in this area.
另一个充满潜力的领域是被称为“零知识协议”的算法。这些协议使用一种特殊的加密技术,实现了比数字签名更令人惊喜的功能:它们允许两个或多个实体合并信息,而无需透露任何单独的信息。一个潜在的应用是在线拍卖。使用零知识协议,竞标者可以以加密的方式相互提交出价,这样就能确定中标者,但其他出价的信息不会被透露给任何人!零知识协议的构思非常巧妙,如果它们在实践中得到应用,我很容易就能将其列入我的“伟大算法”清单。但到目前为止,它们还没有得到广泛的应用。
Another fertile area is a class of algorithms known as “zero-knowledge protocols.” These protocols use a special type of cryptography to achieve something even more surprising than a digital signature: they let two or more entities combine information without revealing any of the individual pieces of information. One potential application is for online auctions. Using a zero-knowledge protocol, the bidders can cryptographically submit their bids to each other in such a way that the winning bidder is determined, but no information about the other bids is revealed to anyone! Zero-knowledge protocols are such a clever idea that they would easily make it into my canon of great algorithms, if only they were used in practice. But so far, they haven't achieved widespread use.
另一个在学术界获得大量研究但实际应用有限的概念是“分布式哈希表”。这些表是一种巧妙的存储点对点系统信息的方式——该系统没有中央服务器来指挥信息流。然而,在撰写本文时,许多自称点对点的系统实际上使用中央服务器来实现部分功能,因此无需依赖分布式哈希表。
Another idea that has received an immense amount of academic research but limited practical use is a technique known as “distributed hash tables.” These tables are an ingenious way of storing the information in a peer-to-peer system—a system that has no central server directing the flow of information. At the time of writing, however, many of the systems that claim to be peer-to-peer in fact use central servers for some of their functionality and thus do not need to rely on distributed hash tables.
“拜占庭容错”技术也属于同一类别:一种令人惊叹且优美的算法,但由于缺乏应用,目前尚不能被归类为优秀算法。拜占庭容错允许某些计算机系统容忍任何类型的错误(只要同时发生的错误不太多)。这与更常见的容错概念形成了对比,在更常见的容错概念中,系统可以承受较为温和的错误,例如磁盘驱动器的永久性故障或操作系统崩溃。
The technique of “Byzantine fault tolerance” falls in the same category: a surprising and beautiful algorithm that can't yet be classed as great, due to lack of adoption. Byzantine fault tolerance allows certain computer systems to tolerate any type of error whatsoever (as long as there are not too many simultaneous errors). This contrasts with the more usual notion of fault tolerance, in which a system can survive more benign errors, such as the permanent failure of a disk drive or an operating system crash.
伟大的算法会消失吗?
CAN GREAT ALGORITHMS FADEAWAY?
除了推测哪些算法未来可能成为伟大算法之外,我们或许还会思考,我们现有的一些“伟大”算法——那些我们经常使用却从未想过的不可或缺的工具——是否会失去其重要性。历史也可以为我们指明方向。如果我们将注意力局限于特定的算法,算法确实会失去其重要性。最明显的例子就是密码学,在这个领域,发明新加密算法的研究人员与其他试图破解这些算法安全性的研究人员之间,存在着一场持续不断的“军备竞赛”。作为一个具体的例子,我们来看看所谓的密码哈希函数。被称为MD5的哈希函数是互联网的官方标准,自20世纪90年代初以来一直被广泛使用,但自那时起,MD5中发现了严重的安全漏洞,因此不再推荐使用它。同样,我们在第9章中讨论过,如果能够构建合理规模的量子计算机,RSA数字签名方案将很容易被破解。
In addition to speculating about what algorithms might rise to greatness in the future, we might wonder whether any of our current “great” algorithms—indispensable tools that we use constantly without even thinking about it—might fade in importance. History can guide us here, too. If we restrict attention to particular algorithms, it is certainly true that algorithms can lose relevance. The most obvious example is in cryptography, in which there is a constant arms race between researchers inventing new crypto algorithms, and other researchers inventing ways to crack the security of those algorithms. As a specific instance, consider the so-called cryptographic hash functions. The hash function known as MD5 is an official internet standard and has been widely used since the early 1990s, yet significant security flaws have been found in MD5 since then, and its use is no longer recommended. Similarly, we discussed in chapter 9 the fact that the RSA digital signature scheme will be easy to crack if and when it becomes possible to build quantum computers of a reasonable size.
然而,我认为这样的例子对我们的问题的回答过于狭隘。诚然,MD5 已被破解(顺便说一句,它的主要继承者 SHA-1 也已被破解),但这并不意味着加密哈希函数的核心思想无关紧要。事实上,这类哈希函数应用极其广泛,而且还有很多未被破解的。因此,只要我们以足够开阔的眼光看待问题,并准备在保留算法核心思想的同时调整其细节,那么我们目前许多优秀的算法在未来似乎不太可能失去其重要性。
However, I think examples like this answer our question too narrowly. Sure, MD5 is broken (and, by the way, so is its main successor, SHA-1), but that doesn't mean the central idea of cryptographic hash functions is irrelevant. Indeed, such hash functions are used extremely widely, and there are plenty of uncracked ones out there. So, provided we take a broad enough view of the situation and are prepared to adapt the specifics of an algorithm while retaining its main ideas, it seems unlikely that many of our presently great algorithms will lose their importance in the future.
我们学到了什么?
WHAT HAVE WE LEARNED?
从这里介绍的这些伟大的算法中,我们能得出一些共同的主题吗?其中一个主题令我作为本书作者感到非常惊讶:所有这些伟大的思想都无需任何计算机编程或任何其他计算机科学知识就能得到解释。当我开始撰写这本书时,我假设这些伟大的算法可以分为两类。第一类算法的核心是一些简单而巧妙的技巧——这些技巧无需任何技术知识就能解释。第二类算法与先进的计算机科学思想息息相关,以至于没有这方面背景的读者无法理解。我计划通过提供一些(但愿如此)关于这些算法的有趣历史轶事,解释它们的重要应用,并大声宣扬这些算法的巧妙之处(尽管我无法解释它的工作原理),来收录第二类算法。想象一下,当我发现所有被选中的算法都属于第一类时,我是多么的惊讶和欣喜!确实,许多重要的技术细节被省略了,但在每种情况下,使整个事情运转的关键机制都可以用非专业概念来解释。
Are there any common themes that can be drawn out from the great algorithms presented here? One theme, which was a great surprise to me as the author of the book, is that all of the big ideas can be explained without requiring previous knowledge of computer programming or any other computer science. When I started work on the book, I assumed that the great algorithms would fall into two categories. The first category would be algorithms with some simple yet clever trick at their core—a trick that could be explained without requiring any technical knowledge. The second category would be algorithms that depended so intimately on advanced computer science ideas that they could not be explained to readers with no background in this area. I planned to include algorithms from this second category by giving some (hopefully) interesting historical anecdotes about the algorithms, explaining their important applications, and vociferously asserting that the algorithm was ingenious even though I couldn't explain how it worked. Imagine my surprise and delight when I discovered that all the chosen algorithms fell into the first category! To be sure, many important technical details were omitted, but in every case, the key mechanism that makes the whole thing work could be explained using nonspecialist notions.
我们所有算法共同的另一个重要主题是,计算机科学领域远不止编程。每当我教授计算机科学入门课程时,我都会让学生告诉我他们认为计算机科学究竟是什么。到目前为止,最常见的回答是“编程”,或者类似的说法,例如“软件工程”。当被要求提供计算机科学的其他方面时,许多人都不知所措。但接下来的一个常见问题是与硬件相关的,例如“硬件设计”。这有力地证明了人们对计算机科学家真正工作内容的普遍误解。读完本书后,我希望你对计算机科学家花时间思考的问题以及他们提出的解决方案类型有了更具体的了解。
Another important theme common to all our algorithms is that the field of computer science is much more than just programming. Whenever I teach an introductory computer science course, I ask the students to tell me what they think computer science actually is. By far the most common response is “programming,” or something equivalent such as “software engineering.” When pressed to provide additional aspects of computer science, many are stumped. But a common follow-up is something related to hardware, such as “hardware design.” This is strong evidence of a popular misconception about what computer scientists really do. Having read this book, I hope you have a much more concrete idea of the problems that computer scientists spend their time thinking about, and the types of solutions they come up with.
一个简单的类比就很有帮助。假设你遇到一位主要研究兴趣是日本文学的教授。这位教授很可能能说、读、写日语。但如果你被要求猜测这位教授在研究过程中花费最多时间思考的是什么,你不会想到“日语”。事实上,日语是研究构成日本文学的主题、文化和历史的必备知识。另一方面,一个能说一口流利日语的人可能对日本文学一无所知(日本可能有数百万这样的人)。
A simple analogy will help here. Suppose you meet a professor whose main research interest is Japanese literature. It is extremely likely that the professor can speak, read, and write Japanese. But if you were asked to guess what the professor spends the most time thinking about while conducting research, you would not guess “the Japanese language.” Rather, the Japanese language is a necessary piece of knowledge for studying the themes, culture, and history that comprise Japanese literature. On the other hand, someone who speaks perfect Japanese might be perfectly ignorant of Japanese literature (there are probably millions of such people in Japan).
计算机编程语言与计算机科学核心思想之间的关系非常相似。为了实现和试验算法,计算机科学研究人员需要将算法转换成计算机程序,而每个程序都用一种编程语言编写,例如 Java、C++ 或 Python。因此,掌握编程语言对于计算机科学家来说至关重要,但这仅仅是一个先决条件:主要的挑战在于发明、调整和理解算法。在阅读本书中那些优秀的算法之后,我希望读者能够更加深刻地理解这种区别。
The relationship between computer programming languages and the main ideas of computer science is quite similar. To implement and experiment with algorithms, computer science researchers need to convert the algorithms into computer programs, and each program is written in a programming language, such as Java, C++, or Python. Thus, knowledge of a programming language is essential for computer scientists, but it is merely a prerequisite: the main challenge is to invent, adapt, and understand algorithms. After seeing the great algorithms in this book, it is my hope that readers will have a much firmer grasp of this distinction.
旅程结束
THE END OF OUR TOUR
我们已经结束了这场深入而又日常的计算世界的旅程。我们实现了目标吗?你与计算设备的交互会因此而有所不同吗?
We've reached the end of our tour through the world of profound, yet everyday, computation. Did we achieve our goals? And will your interactions with computing devices be any different as a result?
好吧,很有可能,下次你访问一个安全网站时,你会好奇地想知道是谁为其可信度担保,并查看你的网络浏览器检查过的数字证书链(第 9 章)。又或者,下次在线交易因莫名其妙的原因失败时,你会感到庆幸而不是沮丧,因为你知道数据库一致性应该能确保你不会因为没有订购的东西而被收取费用(第 8 章)。又或者,有一天你会自言自语,“如果我的电脑能帮我做到这一点,那不是很好吗”——结果却发现这是不可能的,因为你希望的任务可以使用与我们的崩溃查找程序相同的方法来证明是不可判定的(第 10 章)。
Well, it's just possible that next time you visit a secure website, you'll be intrigued to know who vouched for its trustworthiness, and check out the chain of digital certificates that were inspected by your web browser (chapter 9). Or perhaps the next time an online transaction fails for an inexplicable reason, you'll be grateful instead of frustrated, knowing that database consistency should ensure you won't be charged for something you failed to order (chapter 8). Or maybe you'll be musing to yourself one day, “Now wouldn't it be nice if my computer could do this for me”—only to realize that it's an impossibility, because your wished-for task can be proved unde-cidable using the same method as for our crash-finding program (chapter 10).
我相信你还能想到更多例子,说明那些伟大的算法知识可能会改变你与计算机的交互方式。然而,正如我在引言中谨慎指出的那样,这并非本书的主要目标。我的主要目标是让读者对伟大的算法有足够的了解,让他们对一些日常的计算任务产生好奇——就像业余天文学家对夜空的欣赏一样。
I'm sure you can think of plenty more examples in which knowledge of the great algorithms might change the way you interact with a computer. However, as I was careful to state in the introduction, this wasn't the primary objective of the book. My chief goal was to give readers enough knowledge about the great algorithms that they gain a sense of wonder at some of their ordinary computing tasks—much as an amateur astronomer has a heightened appreciation of the night sky.
只有你,读者,才能知道我是否成功了。但有一件事是肯定的:你的个人天赋就在你的指尖。请随意运用它。
Only you, the reader, can know whether I succeeded in this goal. But one thing is certain: your own personal genius is right at your fingertips. Feel free to use it.
致谢
ACKNOWLEDGMENTS
——W.A.T . W. HITMAN,《公路之歌》
—WALT WHITMAN, Song of the Open Road
许多朋友、同事和家人阅读了部分或全部手稿。其中包括 Alex Bates、Wilson Bell、Mike Burrows、Walt Chromiak、Michael Isard、Alastair MacCormick、Raewyn MacCormick、Nico-letta Marini-Maio、Frank McSherry、Kristine Mitchell、Ilya Mironov、Wendy Pollack、Judith Potter、Cotten Seiler、Helen Takacs、Kunal Talwar、Tim Wahls、Jonathan Waller、Udi Wieder 和 Ollie Williams。这些读者的建议促成了手稿的大量实质性改进。两位匿名审稿人的评论也带来了显著的改进。
Many friends, colleagues, and family members read some or all of the manuscript. Among them are Alex Bates, Wilson Bell, Mike Burrows, Walt Chromiak, Michael Isard, Alastair MacCormick, Raewyn MacCormick, Nico-letta Marini-Maio, Frank McSherry, Kristine Mitchell, Ilya Mironov, Wendy Pollack, Judith Potter, Cotten Seiler, Helen Takacs, Kunal Talwar, Tim Wahls, Jonathan Waller, Udi Wieder, and Ollie Williams. Suggestions from these readers resulted in a large number of substantial improvements to the manuscript. The comments of two anonymous reviewers also resulted in significant improvements.
Chris Bishop 提供了鼓励和建议。Tom Mitchell 允许在第 6 章中使用他的图片和源代码。
Chris Bishop provided encouragement and advice. Tom Mitchell gave permission to use his pictures and source code in chapter 6.
维姬·卡恩(本书的编辑)和她在普林斯顿大学出版社的同事在孵化该项目并使其取得成果方面做得非常出色。
Vickie Kearn (the book's editor) and her colleagues at Princeton University Press did a wonderful job of incubating the project and bringing it to fruition.
迪金森学院数学与计算机科学系的同事们一直给予我支持和友谊。
My colleagues in the Department of Mathematics and Computer Science at Dickinson College were a constant source of support and camaraderie.
迈克尔·伊萨德和迈克·伯罗斯向我展示了计算的乐趣和美丽。安德鲁·布莱克教会了我如何成为一名更好的科学家。
Michael Isard and Mike Burrows showed me the joy and beauty of computing. Andrew Blake taught me how to be a better scientist.
我的妻子克里斯汀一直在那里,现在仍然在这里;许多未曾谋面的人也在这里。
My wife Kristine was always there and is here still; much unseen is also here.
我向所有这些人表达最深切的谢意。谨以此书,满怀爱意地献给克里斯汀。
To all these people I express my deepest gratitude. The book is dedicated, with love, to Kristine.
资料来源及延伸阅读
SOURCES AND FURTHER READING
如第8页所述,本书不使用文内引用。相反,所有来源都列在下面,并为有兴趣深入了解计算机科学伟大算法的读者提供一些延伸阅读建议。
As explained on page 8, this book does not use in-text citations. Instead, all sources are listed below, together with suggestions of further reading for those interested in finding out more about the great algorithms of computer science.
这句题词出自万尼瓦尔·布什的文章《诚如我们所想》,最初发表于 1945 年 7 月的《大西洋》杂志。
The epigraph is from Vannevar Bush's essay “As We May Think,” originally published in the July 1945 issue of The Atlantic magazine.
简介(第 1 章)。如果想了解一些关于算法和其他计算机技术的平易近人、富有启发性的解释,我推荐 Chris Bishop 的 2008 年皇家学会圣诞讲座,其视频可以免费在线获取。讲座假设读者不需要任何计算机科学知识。AK Dewdney 的新图灵合集有效地扩展了本卷涵盖的几个主题,并介绍了更多有趣的计算机科学概念 - 但可能需要一些计算机编程知识才能完全欣赏这本书。Juraj Hromkovi的算法冒险对于具有一点数学背景但没有计算机科学知识的读者来说是一个很好的选择。在众多大学水平的计算机科学算法教材中,有三本特别易读的教材:Dasgupta、Papadimitriou 和 Vazirani 合著的算法;Harel 和 Feldman合著的算法:计算精神;以及Cormen、Leiserson、Rivest 和 Stein 合著的算法导论。
Introduction (chapter 1). For some accessible, enlightening explanations of algorithms and other computer technology, I recommend Chris Bishop's 2008 Royal Institution Christmas lectures, videos of which are freely available online. The lectures assume no prior knowledge of computer science. A. K. Dewdney's New Turing Omnibus usefully amplifies several of the topics covered in the present volume and introduces many more interesting computer science concepts—but some knowledge of computer programming is probably required to fully appreciate this book. Juraj Hromkovi's Algorithmic Adventures is an excellent option for readers with a little mathematical background, but no knowledge of computer science. Among the many college-level computer science texts on algorithms, three particularly readable options are Algorithms, by Dasgupta, Papadimitriou, and Vazirani; Algorithmics: The Spirit of Computing, by Harel and Feldman; and Introduction to Algorithms, by Cormen, Leiserson, Rivest, and Stein.
搜索引擎索引(第二章)。AltaVista最初涉及元词技巧的专利是美国专利 6105019,名为“索引的约束搜索”,由 Mike Burrows 于 2000 年发明。对于具有计算机科学背景的读者, Croft、Metzler 和 Strohman 合著的《搜索引擎:实践中的信息检索》是学习索引及搜索引擎其他方面知识的不错选择。
Search engine indexing (chapter 2). The original AltaVista patent covering the metaword trick is U.S. patent 6105019, “Constrained Searching of an Index,” by Mike Burrows (2000). For readers with a computer science background, Search Engines: Information Retrieval in Practice, by Croft, Metzler, and Strohman, is a good option for learning more about indexing and many other aspects of search engines.
PageRank(第三章)。拉里·佩奇的开篇引言摘自本·埃尔金的访谈,发表于2004年5月3日的《商业周刊》。如上所述,万尼瓦尔·布什的《诚如所思》最初发表于《大西洋月刊》(1945年7月)。毕晓普的讲座(见上文)包含一个精妙的PageRank演示,他使用水管系统模拟超链接。描述谷歌架构的原始论文是《大型超文本网络搜索引擎的剖析》,由谷歌联合创始人谢尔盖·布林和拉里·佩奇撰写,并在1998年的万维网大会上发表。该论文对PageRank进行了简要的描述和分析。朗维尔和迈耶合著的《谷歌的PageRank及其未来》中则进行了更具技术性和更广泛的分析,但这本书需要大学水平的线性代数知识。约翰·巴特尔的《搜索》一书以通俗易懂、引人入胜的网络搜索行业历史开篇,包括谷歌的崛起。第 36 页提到的网络垃圾邮件在 Fetterly、Manasse 和 Najork 撰写的“垃圾邮件、该死的垃圾邮件和统计数据:使用统计分析来定位垃圾网页”一文中进行了讨论,并发表在 2004 年 WebDB 会议上。
PageRank (chapter 3). The opening quotation by Larry Page is taken from an interview by Ben Elgin, published in Businessweek, May 3, 2004. Vannevar Bush's “As We May Think” was, as mentioned above, originally published in The Atlantic magazine (July 1945). Bishop's lectures (see above) contain an elegant demonstration of PageRank using a system of water pipes to emulate hyperlinks. The original paper describing Google's architecture is “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” written by Google's co-founders, Sergey Brin and Larry Page, and presented at the 1998 World Wide Web conference. The paper includes a brief description and analysis of PageRank. A much more technical, wide-ranging analysis appears in Langville and Meyer's Google's PageRank and Beyond—but this book requires college-level linear algebra. John Battelle's The Search begins with an accessible and interesting history of the web search industry, including the rise of Google. The web spam mentioned on page 36 is discussed in “Spam, Damn Spam, and Statistics: Using Statistical Analysis to Locate Spam Web Pages,” by Fetterly, Manasse, and Najork, and published in the 2004 WebDB conference.
公钥密码学(第 4 章)。西蒙·辛格的《密码本》对密码学的诸多方面(包括公钥)进行了精彩且通俗易懂的描述。书中还详细讲述了英国政府通信勤务局(GCHQ)秘密发现公钥密码学的故事。毕晓普的讲座(见上文)巧妙地运用了调漆类比的公钥密码学实例。
Public key cryptography (chapter 4). Simon Singh's The Code Book contains superb, accessible descriptions of many aspects of cryptography, including public key. It also recounts in detail the story of the secret discovery of public key cryptography at GCHQin Britain. Bishop's lectures (see above) contain a clever practical demonstration of the paint-mixing analogy for public key crypto.
纠错码(第 5 章)。有关汉明的轶事记录在 Thomas M. Thompson 的《从纠错码到球体堆积再到单群》一书中。本书第 60 页引用了汉明的话,源自 1977 年 Thompson 对汉明的采访。数学家们会非常喜欢 Thompson 这本有趣的书,但本书假设读者具有大量的大学数学知识。Dewdney 的书(见上文)有两章关于编码理论的有趣内容。第 77-78 页关于香农的两段引言摘自 NJA Sloane 和 AD Wyner 撰写的简短传记,收录于Sloane 和 Wyner 编辑的《克劳德·香农:论文集》(1993 年)。
Error correcting codes (chapter 5). The anecdotes about Hamming are documented in Thomas M. Thompson's From Error-Correcting Codes through Sphere Packings to Simple Groups. The quotation from Hamming on page 60 is also given in this book and derives from a 1977 interview of Hamming by Thompson. Mathematicians will greatly enjoy Thompson's delightful book, but it definitely assumes the reader has a healthy dose of college math. Dewdney's book (see above) has two interesting chapters on coding theory. The two quotations about Shannon on pages 77-78 are taken from a brief biography by N. J. A. Sloane and A. D. Wyner, appearing in Claude Shannon: Collected Papers edited by Sloane and Wyner (1993).
模式识别(第 6 章)。Bishop的讲座(见上文)有一些有趣的内容,很好地补充了本章。关于政治捐款的地理数据取自《赫芬顿邮报》的 Fundrace 项目。所有手写数字数据均取自纽约大学 Courant 研究所的 Yann LeCun 及其合作者提供的数据集。该数据集(称为 MNIST 数据)的详细信息在 LeCun 等人于 1998 年发表的论文“基于梯度的学习应用于文档识别”中进行了讨论。网络垃圾邮件结果来自 Ntoulas 等人发表在 2006 年万维网会议论文集上的论文“通过内容分析检测垃圾网页”。人脸数据库是由卡内基梅隆大学的著名模式识别研究员 Tom Mitchell 于 20 世纪 90 年代创建的。Mitchell 在卡内基梅隆大学的课堂上使用过这个数据库,并在他具有影响力的著作“机器学习”中对其进行了描述。米切尔在其著作的配套网站上提供了一个计算机程序,用于在人脸数据库上训练和分类神经网络。太阳镜问题的所有结果都是使用该程序略加修改的版本生成的。丹尼尔·克雷维尔在《人工智能:人工智能探索的动荡历史》一文中对达特茅斯人工智能会议进行了精彩的描述。帕梅拉·麦考达克在其1979年出版的著作《思考的机器》中引用了该会议资助提案(第103页)的摘录。
Pattern recognition (chapter 6). Bishop's lectures (see above) have some interesting material that nicely complements this chapter. The geographical data about political donations is taken from the Fundrace project of the Huffington Post. All the handwritten digit data is taken from a dataset provided by Yann LeCun, of New York University's Courant Institute, and his collaborators. Details of the dataset, which is known as the MNIST data, are discussed in the 1998 paper by LeCun et al., “Gradient-Based Learning Applied to Document Recognition.” The web spam results come from Ntoulas et al., “Detecting Spam Web Pages through Content Analysis,” published in the Proceedings of the World Wide Web Conference, 2006. The face database was created in the 1990s by a leading pattern recognition researcher, Tom Mitchell of Carnegie Mellon University. Mitchell has used this database in his classes at Carnegie Mellon and describes it in his influential book, Machine Learning. On the website accompanying his book, Mitchell provides a computer program to perform training and classification of neural networks on the face database. All the results for the sunglasses problem were generated using slightly modified versions of this program. Daniel Crevier gives an interesting account of the Dartmouth AI conference in AI: The Tumultuous History of the Search for Artificial Intelligence. The excerpt from the conference's funding proposal (on page 103) is quoted in Pamela McCorduck's 1979 book, Machines Who Think.
压缩(第七章)。关于法诺、香农以及霍夫曼编码的发现的故事取材于亚瑟·诺伯格在1989年对法诺的采访。该采访内容可从查尔斯·巴贝奇研究所的口述历史档案中找到。我最喜欢的数据压缩处理方法是大卫·麦凯的《信息理论、推理和学习算法》,但这本书需要大学数学水平。杜德尼的书(见上文)包含更简洁、更易懂的讨论。
Compression (chapter 7). The story about Fano, Shannon, and the discovery of Huffman coding is taken from a 1989 interview of Fano by Arthur Norberg. The interview is available from the oral history archive of the Charles Babbage Institute. My favorite treatment of data compression is in Information Theory, Inference, and Learning Algorithms, by David MacKay, but this book requires college-level math. Dewdney's book (see above) contains a much briefer and more accessible discussion.
数据库(第 8 章)。市面上有大量书籍为初学者提供数据库入门知识,但它们通常只讲解如何使用数据库,而不是解释数据库的工作原理——而这正是本章的目的。即使是大学水平的教科书也往往侧重于数据库的使用。Garcia-Molina、Ullman 和 Widom 合著的《数据库系统》后半部分是一个例外,它对本章涵盖的主题进行了大量的详细说明。
Databases (chapter 8). There is an over-abundance of books providing an introduction to databases for beginners, but they typically explain how to use databases, rather than explaining how databases work—which was the objective of this chapter. Even college-level textbooks tend to focus on the use of databases. One exception is the second half of Database Systems, by Garcia-Molina, Ullman, and Widom, which gives plenty of details on the topics covered in this chapter.
数字签名(第 9 章)。Gail Grant 的《理解数字签名》一书提供了大量有关数字签名的信息,即使没有计算机科学背景的人也能轻松理解。
Digital signatures (chapter 9). Gail Grant's Understanding Digital Signatures provides a great deal of information about digital signatures and is reasonably accessible to those without a computer science background.
可计算性(第 10 章)。本章开头的引文来自理查德·费曼 1959 年 12 月 29 日在加州理工学院的一次演讲。演讲题目为“底部还有充足的空间”,后来发表在加州理工学院的《工程与科学》杂志(1960 年 2 月)上。关于可计算性和不可判定性概念的一个非传统但非常有趣的呈现方式,是以(虚构的)小说形式呈现的:《图灵(一部关于计算的小说)》,作者是克里斯托斯·帕帕迪米特里奥。
Computability (chapter 10). The chapter's opening quotation is from a talk given by Richard Feynman at Caltech on December 29,1959. The title of the talk is “There's Plenty of Room at the Bottom,” and it was later published in Caltech's Engineering & Science magazine (February 1960). One unconventional, but very interesting, presentation of the concepts surrounding computability and undecidability is in the form of a (fictional) novel: Turing (A Novel about Computation), by Christos Papadimitriou.
结论(第11章)。史蒂芬·霍金的讲座“宇宙的未来”是1991年在剑桥大学举行的达尔文讲座,也收录在霍金的著作《黑洞与婴儿宇宙》中。AJP ·泰勒电视系列讲座的标题是《战争如何开始》,并于1977年出版成书。
Conclusion (chapter 11). The Stephen Hawking lecture, “The Future of the Universe,” was the 1991 Darwin lecture given at the University of Cambridge, also reprinted in Hawking's book Black Holes and Baby Universes. The televised A. J. P. Taylor lecture series was entitled How Wars Begin, and was also published as a book in 1977.
指数
INDEX
此书印刷版中的索引与您的电子书页面不匹配。请使用您电子阅读设备上的搜索功能搜索您感兴趣的术语。以下列出了印刷版索引中出现的术语,供您参考。
The index that appeared in the print version of this title does not match the pages in your eBook. Please use the search function on your eReading device to search for terms of interest. For your reference, the terms that appear in the print index are listed below.
加气混凝土
AAC
中止。请参阅事务添加
abort. See transaction addition
算法
algorithm
加法技巧
addition trick
伦纳德·阿德尔曼
Adleman, Leonard
高级加密标准
Advanced Encryption Standard
广告
advertisement
请参阅高级加密标准(AES)。
AES. See Advanced Encryption Standard
AI。参见人工智能
AI. See artificial intelligence
算法:相关书籍;伟大标准;定义;未来;缺乏;与编程的关系;重要性。另请参阅加法算法;校验和;压缩;数字签名;纠错码;Dijkstra 最短路径算法;欧几里得算法;因式分解;JPEG;密钥交换;LZ77;匹配;九种算法;PageRank;公钥;排名;RSA;网页搜索
algorithm: books on; criteria for greatness; definition of; future of; lack of; relationship to programming; significance of. See also addition algorithm; checksum; compression; digital signature; error-correcting code; Dijkstra's shortest-path algorithm; Euclid's algorithm; factorization; JPEG; key exchange; LZ77; matching; nine algorithms; PageRank; public key; ranking; RSA; web search
AltaVista
AltaVista
AlwaysYes.exe
AlwaysYes.exe
亚马逊
Amazon
分析
Analytical
引擎
Engine
反崩溃自启动程序
AntiCrashOnSelf.exe
AntiYesOnSelf.exe
AntiYesOnSelf.exe
苹果
Apple
制品。参见压缩
artifact. See compression
人工智能。另请参阅模式识别
artificial intelligence. See also pattern recognition
人工神经网络。参见“我们可能认为的神经网络”
artificial neural network. See neural network As We May Think
天文学
astronomy
大西洋杂志
Atlantic magazine
原子性。请参阅交易音频。另请参阅
atomic. See transaction audio. See also
压缩奥斯汀,简
compression Austen, Jane
验证
authentication
权威性:网页的分数。另请参阅认证机构
authority: score; of a web page. See also certification authority
权威伎俩
authority trick
B树
B-tree
巴比伦尼亚
Babylonia
备份
backup
银行;账号;余额;密钥;网上银行;签名;转账;作为可信第三方
bank; account number; balance; for keys; online banking; for signatures; transfer; as trusted third party
指数基数
base, in exponentiation
巴特尔,约翰
Battelle, John
贝尔电话公司
Bell Telephone Company
二进制
binary
必应
Bing
生物学
biology
生物识别传感器
biometric sensor
克里斯托弗·毕晓普
Bishop, Christopher
少量
bit
分组密码
block cipher
网页主体
body, of a web page
脑
brain
布林、谢尔盖
Brin, Sergey
英国政府
British government
浏览器
browser
暴力破解
brute force
漏洞
bug
伯罗斯,迈克
Burrows, Mike
布什,万尼瓦尔
Bush, Vannevar
《商业周刊》
Businessweek
拜占庭容错
Byzantine fault tolerance
C++编程语言
C++ programming language
CA.请参阅认证机构演算
CA. See certification authority calculus
加州理工学院
Caltech
剑桥
Cambridge
CanCrash 程序
CanCrash.exe
CanCrashWeird.exe
CanCrashWeird.exe
卡内基梅隆大学
Carnegie Mellon University
光盘
CD
手机。请参阅手机证书
cell phone. See phone certificate
认证机构
certification authority
查尔斯·巴贝奇研究所
Charles Babbage Institute
聊天机器人
chat-bot
支票簿
checkbook
校验和;实践中;简单;阶梯式。另请参阅加密散列函数
checksum; in practice; simple; staircase. See also cryptographic hash function
校验和技巧
checksum trick
化学
chemistry
棋
chess
阿隆佐·丘奇
Church, Alonzo
丘奇-图灵论题
Church-Turing thesis
引用
citations
班级
class
分类
classification
分类器
classifier
时钟运算
clock arithmetic
时钟大小;条件;分解;需要大;主要;作为公共数字;在RSA中;次要
clock size; conditions on; factorization of; need for large; primary; as a public number; in RSA; secondary
科德,EF
Codd, E. F.
代码字
code word
提交阶段
commit phase
光盘。参见CD 压缩;通过 AAC(参见AAC);音频或音乐制品;图像历史;通过 JPEG(参见 JPEG);无损;有损;通过 MP3(参见 MP3);与纠错码的关系;用途;视频
compact disk. See CD compression; via AAC (see AAC); artifact; of audio or music; history of; of images; via JPEG (see JPEG); lossless; lossy; via MP3 (see MP3); relation to error-correcting code; uses of; of video
可计算:数字;问题。另请参阅不可计算
computable: number; problem. See also uncomputable
计算机:20 世纪 80 年代和 90 年代;对精度的要求;对经典的欣赏;与人类相比;早期;错误(另见纠错码;错误检测);第一个电子计算机;人类计算器的基本工作;智能(见人工智能);笔记本电脑(见笔记本电脑);限制;机械;现代;量子(见量子计算);路由器(见路由器);服务器(见服务器);用户。另见硬件
computer: 1980s and ‘90s; accuracy requirement of; appreciation of; classical; compared to humans; early; error (see also error-correcting code; error detection); first electronic; fundamental jobs of; human calculator; intelligent (see artificial intelligence); laptop (see laptop); limitations on; mechanical; modern; quantum (see quantum computing); router (see router); server (see server); users. See also hardware
计算机程序;分析另一个程序;可执行;不可能;输入和输出;智能;程序员;编程;编程语言;验证;世界上第一个程序员;是非
computer program; analyzing another program; executable; impossible; input and output; intelligent; programmers; programming; programming languages; verification; world's first programmer; yes-no
计算机编程。参见计算机程序
computer programming. See computer program
计算机科学;美;确定性;课程;创立;在高中;入门教学;普及度;预测;公众对……的看法;研究;在社会中;理论;不可判定问题
computer science; beauty in; certainty in; curriculum; founding of; in high school; introductory teaching; popularity of; predictions about; public perception of; research; in society; theory of; undecidable problems in
计算机器和
Computing Machinery and
智力
Intelligence
并发
concurrency
意识
consciousness
一致性。另请参阅不一致性
consistency. See also inconsistency
矛盾。参见矛盾证明
contradiction. See proof by contradiction
托马斯·科尔门
Cormen, Thomas
宇宙学
cosmology
圣约女人
Covenant Woman
中央处理器
CPU
撞机;故意
crash; intentional
崩溃问题
Crashing Problem
CrashOnSelf.exe,CRC32
CrashOnSelf.exe, CRC32
信用卡
credit card
丹尼尔·克雷维尔
Crevier, Daniel
克罗夫特,布鲁斯
Croft, Bruce
加密哈希函数
cryptographic hash function
密码学;公钥(参见公钥密码学)楔形文字
cryptography; public key (see public key cryptography) cuneiform
循环
cycle
达特茅斯人工智能会议
Dartmouth AI conference
达斯古普塔,桑乔伊
Dasgupta, Sanjoy
数据中心
data center
数据库;列;定义;地理复制;面;关系;复制;行;表。另请参阅虚拟表
database; column; definition of; geographically replicated; of faces; relational; replicated; row; table. See also virtual table
僵局
deadlock
决策树
decision tree
解密
decrypt
深蓝
Deep Blue
民主党人
Democrat
阿拉斯加州杜德尼
Dewdney, A. K.
查尔斯·狄更斯
Dickens, Charles
迪菲·惠特菲尔德
Diffie, Whitfield
Diffie-Hellman。请参阅密钥交换
Diffie-Hellman. See key exchange
数字签名;应用;与密码学的联系;检测伪造;长消息;实践;安全性。另请参阅RSA;签名
digital signature; applications of; connection to cryptography; detect forgery of; of long messages; in practice; security of. See also RSA; signature
Dijkstra 最短路径算法
Dijkstra's shortest-path algorithm
离散指数运算
discrete exponentiation
离散对数
discrete logarithm
磁盘。请参阅硬盘分布式哈希表
disk. See hard disk distributed hash table
双击
double-click
柯南·道尔
Doyle, Arthur Conan
驱动器。请参阅硬盘 DVD
drive. See hard disk DVD
鲍勃·迪伦
Dylan, Bob
易趣
eBay
电子商务
e-commerce
埃尔金,本
Elgin, Ben
电子邮件
艾玛
Emma
加密;128位加密
encrypt; 128-bit encryption
工程
engineering
判定问题
Entscheidungsproblem
错误检测
error detection
纠错码;与压缩的关系
error-correcting code; relation to compression
关于人类的论文
Essay Concerning Human
理解
Understanding
以太网
Ethernet
欧几里得
Euclid
欧几里得算法
Euclid's algorithm
兴奋性
excitatory
指数
exponent
幂运算。另请参阅离散幂运算;幂符号
exponentiation. See also discrete exponentiation; power notation
扩展名。请参阅文件扩展名
extension. See file name extension
人脸数据库。请参阅数据库
face database. See database
人脸识别
face recognition
因式分解
factorization
罗伯特·法诺
Fano, Robert
法拉第,迈克尔容错
Faraday, Michael fault-tolerance
传真
fax
费尔德曼,伊沙伊
Feldman, Yishai
丹尼斯·费特利
Fetterly, Dennis
理查德·费曼
Feynman, Richard
文件扩展名;
file name extension;
取消隐藏
unhide
财务信息
financial information
有限域代数
finite field algebra
闪存。参见内存
flash memory. See memory
伪造。另请参阅数字签名
forgery. See also digital signature
冻结
freeze
冻结程序
Freeze.exe
Fundrace项目
Fundrace project
车库
garage
加西亚-莫利纳,赫克托
Garcia-Molina, Hector
英国政府通信总部
GCHQ
遗传学
genetics
GeoTrust
GeoTrust
GlobalSign
GlobalSign
谷歌
格兰特,盖尔
Grant, Gail
格雷,吉姆
Gray, Jim
伟大的美国音乐厅
Great American Music Hall
黑客
hacker
停止。参见终止
halt. See terminate
停机问题
halting problem
理查德·哈明
Hamming, Richard
汉明码
Hamming code
手写识别
handwriting recognition
硬盘;故障;操作;空间
hard disk; failure; operation; space
硬件;故障
hardware; failures of
哈迪,GH
Hardy, G. H.
哈雷尔·大卫
Harel, David
哈希表
hash tables
霍金,斯蒂芬
Hawking, Stephen
草垛
haystack
赫尔曼,马丁
Hellman, Martin
戴夫·休利特
Hewlett, Dave
惠普
Hewlett-Packard
隐藏文件
hidden files
高清
high-definition
命中,用于网络搜索查询
hit, for web search query
福尔摩斯、夏洛克
Holmes, Sherlock
赫罗姆科夫茨,尤拉伊
Hromkovc, Juraj
HTML
HTML
http
http
https
https
赫芬顿邮报
Huffington Post
大卫·霍夫曼
Huffman, David
哈夫曼编码
Huffman coding
超链接;循环(参见循环);传入
hyperlink; cycle of (see cycle); incoming
超链接技巧
hyperlink trick
IBM
IBM
信息和通信技术
ICT
幂等
idempotent
传入链接。请参阅超链接
incoming link. See hyperlink
副本不一致;崩溃后。另请参阅一致性
inconsistency; after a crash; of replicas. See also consistency
索引。另请参阅索引
index. See also indexing
索引;AltaVista 专利;历史;使用元词;与词位置
indexing; AltaVista patent on; history of; using metawords; with word locations
信息检索
information retrieval
信息论
information theory
信息搜索
Infoseek
抑制
inhibitory
保险
insurance
整数分解。参见因式分解
integer factorization. See factorization
互联网;地址;通信方式;公司;协议;标准;浏览
internet; addresses; communication via; companies; protocols; standards; surfing
intitle,网页搜索关键词
intitle, web search keyword
日本人
Japanese
Java编程语言
Java programming language
乔布斯,史蒂夫
Jobs, Steve
连接操作
join operation
JPEG
JPEG
加里·卡斯帕罗夫
Kasparov, Garry
密钥:在密码学中(另请参阅公钥;共享秘密);在数据库中;在数字签名中;物理
key: in cryptography (see also public key; shared secret); in a database; in digital signature; physical
密钥交换;Diffie-Hellman
key exchange; Diffie-Hellman
键盘
keyboard
千字节
kilobyte
K最近邻
K-nearest-neighbors
标记
labeled
朗维尔,艾米 N.
Langville, Amy N.
笔记本电脑
laptop
学习。另请参阅培训
learning. See also training
省略技巧
leave-it-out trick
扬·乐昆
LeCun, Yann
查尔斯·莱瑟森
Leiserson, Charles
亚伯拉罕·伦佩尔
Lempel, Abraham
牌照
license plate
林肯·亚伯拉罕
Lincoln, Abraham
线性代数
linear algebra
链接。请参阅超链接
link. See hyperlink
基于链接的排名。请参阅排名
link-based ranking. See ranking
实时搜索
Live Search
锁:在密码学中;在数据库中
lock: in cryptography; in a database
锁起来。参见冻结锁箱
lock up. See freeze lockbox
洛克,约翰
Locke, John
对数。另请参阅离散对数
logarithm. See also discrete logarithm
洛斯阿尔托斯
Los Altos
无损压缩。参见压缩
lossless compression. See compression
有损压缩。请参阅压缩
lossy compression. See compression
艾达·洛夫莱斯
Lovelace, Ada
爱的徒劳
Love's Labour's Lost
低密度奇偶校验码
low-density parity-check code
莱科斯
Lycos
LZ77
LZ77
机器学习(书籍)
Machine Learning (book)
机器学习。参见模式识别
machine learning. See pattern recognition
麦凯,大卫
MacKay, David
马克·马纳塞
Manasse, Mark
主服务器。请参阅副本匹配
master. See replica matching
数学家
mathematician
数学家的道歉
Mathematician's Apology, A
数学;古代问题;美;确定性;历史;假装
mathematics; ancient problems in; beauty in; certainty in; history of; pretend
帕梅拉·麦考达克
McCorduck, Pamela
MD5
MD5
药品
medicine
百万像素
megapixel
梅梅克斯
memex
存储器:计算机;闪存
memory: computer; flash
门洛帕克
Menlo Park
元词;在 HTML 中
metaword; in HTML
元词技巧;定义。另请参阅索引
metaword trick; definition of. See also indexing
唐纳德·梅茨勒
Metzler, Donald
迈耶,卡尔 D.
Meyer, Carl D.
微软
Microsoft
微软 Excel
Microsoft Excel
微软办公软件
Microsoft Office
微软研究院
Microsoft Research
微软 Word
Microsoft Word
头脑
mind
麻省理工学院
MIT
米切尔,汤姆
Mitchell, Tom
MNIST
MNIST
手机。参见手机监控
mobile phone. See phone monitor
MP3
MP3
MSN
MSN
乘法挂锁技巧
multiplicative padlock trick
MySpace
MySpace
马克·纳乔克
Najork, Marc
名称大小工具
NameSize.exe
搜索查询中的 NEAR 关键字;用于排名
NEAR keyword in search query; for ranking
最近邻分类器
nearest-neighbor classifier
最近邻技巧
nearest-neighbor trick
Netix
Netix
网络:计算机;设备;神经网络(见神经网络);协议;社交(见社交网络)
network: computer; equipment; neural (see neural network); protocol; social (see social network)
神经网络;人工;生物;卷积;太阳镜问题;雨伞问题;训练
neural network; artificial; biological; convolutional; for sunglasses problem; for umbrella problem; training
神经元
neuron
神经科学
neuroscience
纽约
New York
纽约大学
New York University
九种算法
nine algorithms
诺贝尔奖
Nobel Prize
诺伯格,亚瑟
Norberg, Arthur
亚历山德拉·恩图拉斯
Ntoulas, Alexandras
数字混合技巧
number-mixing trick
物体识别
object recognition
单向行动
one-way action
网上银行。参见“银行”
online banking. See bank
网上支付账单
online bill payment
操作系统
operating system
开销
overhead
牛津
Oxford
包
packet
挂锁。请参阅实体挂锁技巧
padlock. See physical padlock trick
页面大小
page size
佩奇,拉里
Page, Larry
PageRank
PageRank
颜料混合技巧
paint-mixing trick
帕洛阿尔托
Palo Alto
帕帕迪米特里奥,克里斯托斯
Papadimitriou, Christos
悖论
paradox
平价
parity
密码
password
专利
patent
模式识别;应用;与人工智能的联系;失败;历史;人工努力;预处理;判断运用
pattern recognition; applications of; connection to artificial intelligence; failures in; history of; manual effort in; preprocessing in; use of judgment in
《电脑杂志》
PC Magazine
对等系统
peer-to-peer system
哲学
philosophy
电话;账单;号码。另请参阅贝尔电话公司
phone; bill; number. See also Bell Telephone Company
照片
photograph
短语查询
phrase query
物理挂锁技巧
physical padlock trick
物理
physics
精准技巧
pinpoint trick
像素
pixel
明信片
postcard
邮政编码
postcode
电源:电气;故障;升至
power: electrical; failure; raising to a
幂符号。另请参阅指数运算
power notation. See also exponentiation
PPN。请参阅公私号码
PPN. See public-private number
准备阶段
prepare phase
准备然后提交的技巧
prepare-then-commit trick
预处理
preprocessing
素数
prime number
原根
primitive root
私人色彩
private color
私人号码
private number
概率。另请参阅重启
probability. See also restart
概率程序。参见计算机程序
probability program. See computer program
程序A.exe
ProgramA.exe
程序B.exe
ProgramB.exe
编程。参见计算机程序
programming. See computer program
投影运算
projection operation
反证法
proof by contradiction
公众色彩
public color
公钥
public key
公钥密码学;与数字签名的联系。另请参阅密码学
public key cryptography; connection to digital signatures. See also cryptography
公众号
public number
公私混合
public-private mixture
公私号码
public-private number
脉搏率
pulse rate
纯的
pure
Python编程语言
Python programming language
量子计算
quantum computing
量子力学
quantum mechanics
快速排序
quicksort
随机冲浪技巧
random surfer trick
排名;基于链接;以及接近度。另请参阅PageRank
ranking; link-based; and nearness. See also PageRank
重启
reboot
冗余
redundancy
冗余技巧
redundancy trick
里德·欧文
Reed, Irving
里德-所罗门码
Reed-Solomon code
关系代数
relational algebra
关系数据库。请参阅数据库
relational database. See database
关联
relevance
重复技巧
repetition trick
复制品;原版
replica; master
复制数据库。请参阅数据库
replicated database. See database
共和党人
Republican
解决
resolution
重启概率
restart probability
右键单击
right-click
罗纳德·里维斯特
Rivest, Ronald
机器人技术
robotics
洛克菲勒基金会
Rockefeller Foundation
回滚。请参阅事务根 CA
roll back. See transaction root CA
圆形的
round
路由器
router
皇家学会圣诞讲座
Royal Institution Christmas Lectures
RSA;因式分解和;量子计算机和;安全性。另请参阅时钟大小
RSA; factoring and; quantum computers and; security of. See also clock size
游程编码
run-length encoding
与之前的技巧相同
same-as-earlier trick
样本
sample
旧金山
San Francisco
卫星
satellite
屏幕。另请参阅显示器
screen. See also monitor
搜索引擎。请参阅网页
search engine. See web
搜索扇区大小
search sector size
安全通信
secure communication
安全哈希。请参阅加密哈希函数
secure hash. See cryptographic hash function
安全性。另请参阅数字签名;RSA
security. See also digital signature; RSA
选择操作
select operation
服务器;
server;
安全的
secure
沙
SHA
莎士比亚,威廉
Shakespeare, William
沙米尔·阿迪
Shamir, Adi
克劳德·香农
Shannon, Claude
香农-法诺编码
Shannon-Fano coding
共享秘密;定义;长度
shared secret; definition of; length of
共享秘密混合
shared secret mixture
短符号技巧
shorter-symbol trick
签名:数字(参见数字签名);手写
signature: digital (see digital signature); handwritten
硅谷
Silicon Valley
简单校验和。请参阅校验和
simple checksum. See checksum
模拟:大脑;随机冲浪者
simulation: of the brain; of random surfer
西蒙·辛格
Singh, Simon
尺寸检查器
SizeChecker.exe
斯隆,新泽西州
Sloane, N. J. A.
智能手机。参见手机窥探
smartphone. See phone snoop
社交网络
social network
软件;下载;可靠性;签名
software; download; reliability of; signed
软件工程
software engineering
来源
sources
垃圾邮件。另请参阅网络垃圾邮件
spam. See also web spam
语音识别
speech recognition
灵性
spirituality
电子表格
spreadsheet
SQL
SQL
阶梯校验和。请参阅校验和
staircase checksum. See checksum
斯坦福大学
Stanford University
星际迷航
Star Trek
统计数据
statistics
斯坦,克利福德
Stein, Clifford
随机梯度下降
stochastic gradient descent
特雷弗·斯特罗曼
Strohman, Trevor
结构:在数据中;在网页中。另请参阅数据库、表结构查询
structure: in data; in a web page. See also database, table structure query
太阳镜问题。参见神经网络超级计算机
sunglasses problem. See neural network supercomputer
支持向量机
support vector machine
冲浪者权威评分
surfer authority score
象征
symbol
表。请参阅数据库、表;虚拟表
table. See database, table; virtual table
标签
tag
双城记
Tale of Two Cities, A
目标值
target value
泰勒,AJP
Taylor, A. J. P.
TCP
TCP
电报
telegraph
电话。参见电话
telephone. See phone
终止
terminate
神学
theology
汤普森,托马斯·M.
Thompson, Thomas M.
阈值;软
threshold; soft
标题:本书的;网页的
title: of this book; of a web page
待办事项清单
to-do list
待办事项清单技巧
to-do list trick
汤姆·索亚
Tom Sawyer
培训。另请参阅学习
training. See also learning
训练数据
training data
交易:中止;原子;在数据库中;在互联网上;回滚
transaction: abort; atomic; in a database; on the internet; rollback
旅行社
travel agent
旅行商问题
Traveling Salesman Problem
技巧,定义
trick, definition of
麻烦制造者
TroubleMaker.exe
图灵,艾伦
Turing, Alan
图灵机
Turing machine
图灵测试
Turing test
电视
TV
马克·吐温
Twain, Mark
二十个问题,游戏
twenty questions, game of
二十个问题技巧
twenty-questions trick
二维宇称。参见宇称
two-dimensional parity. See parity
两阶段提交
two-phase commit
美国内战
U.S. Civil War
厄尔曼,杰弗里·D.
Ullman, Jeffrey D.
不可计算的。另请参阅不可判定的
uncomputable. See also undecidable
不可判定的。另请参阅 不可计算的
undecidable. See also uncomputable
不明确的
undefined
独轮车
unicycle
宇宙
universe
未标记
unlabeled
瓦齐拉尼,乌梅什
Vazirani, Umesh
确认
verification
威瑞信
Verisign
视频
video
电子游戏
video game
虚拟表
virtual table
虚拟表技巧
virtual table trick
沃特斯,爱丽丝
Waters, Alice
网络。查看全球
web. See World Wide
Web 浏览器。请参阅浏览器
Web web browser. See browser
网络搜索;算法;引擎;历史;市场份额;实践。另请参阅索引;匹配;PageRank;排名
web search; algorithms for; engine; history of; market share; in practice. See also indexing; matching; PageRank; ranking
Web 服务器。请参阅服务器
web server. See server
网络垃圾邮件
web spam
WebDB 会议
WebDB conference
网站;安全
website; secure
重量
weight
沃尔特·惠特曼
Whitman, Walt
堪萨斯州威奇托
Wichita, Kansas
寡妇詹妮弗
Widom, Jennifer
运行Word程序
WINWORD.EXE
文字处理器
word processor
单词定位技巧
word-location trick
世界贸易中心
World Trade Center
全球资讯网;
World Wide Web;
会议
conference
史蒂夫·沃兹尼亚克
Wozniak, Steve
预写日志
write-ahead log
预写日志
write-ahead logging
怀纳,AD
Wyner, A. D.
雅虎
Yahoo
是-否程序。参见计算机程序
yes-no program. See computer program
YesOnSelf.exe
YesOnSelf.exe
零,除以
zero, division by
零知识协议
zero-knowledge protocol
ZIP文件
ZIP file
齐夫,雅各布
Ziv, Jacob